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ABSTRACT 



Faculty effectiveness as evaluated by students was 
the focal point of the first invitational conference sponsored by the 
Measurement and Research center of Temple University. Papers 
presented cover: the rationale of student evaluation of faculty, the 
impact of student ratings on academia, the usefulness of student 
evaluations in improving college teaching, some coj^^^t^^if ^^^5"^^^ 
model of faculty evaluation, a system for helping teachers to change 
their affective behavior through feedback, instruments for student 
evaluation of faculty, the Kansas State University Program for 
assessing and improving instructional effectiveness, student 
evaluation of instruction at Michigan State University, criteria for 
evaluation of college teaching, correlates of student ratings, the 
shortcomings of traditional approaches to faculty evaluation and 
faculty performance under stress. (HJM) 
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Between April 25 and 27, 1973, tlie Measurement and Research Center 
of Temple University held its First Invitational Conference "Faculty 
Effectiveness as Evaluated by Students." As is obvious from the title 
of the Conference, the major interest was in the student component of 
faculty evaluation. Basically, there were two reasons for this "narrow" 
approach. First, there exists a more extensive body of knowledge 
regarding evaluation of faculty by students. By default, infoimation 
on administrative and colleague ratings is limited and is rarely within 
the public domain. Second, assuming that the issue of faculty evalu- 
ation cannot be comprehensively covered or even partially resolved in 
a single conference, it was considered preferable to concentrate on 
this small component about which something ^s known rather than became 
involved in the kinds of philosophical speculation that typically 
adhere to the broader issues. 

In planning the Conference, the decision was made to concentrate 
upon five broad areas, and speakers were gathered to cover these broad 
areas at an overview level. Recognizing that treatment at an overview 
level cannot be expected to produce anything more than overview know- 
ledge, a search was begun to find additional speakers who could present 
papers that were more specific and more practical. In deciding upon 
these additional speakers, the guiding consider a cions were toward 
programmatic series of research studies, as well as quality aiid unique- 
ness of approach. 

With a minimum of persuasion, all speakers had complete flexibility 
in their presentations. However, the initial delineation of the subject 
into five broad areas may have been unrealistic insofar as rese-archers 
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in faculty evaluation rarely focus theiv efforts in areas as specific 
as those outlined for the Conference* Even so, it wcs hoped that, 
given & certain amount of overlap, the presentations would be suffi- 
ciently distinct to v»rrant their placement within the five areas. 

The rssults of tiie Conference were mildly surprising. Ihe 
distinctness of the five areas appeared to be at a minimum, and this 
may have been partly due to both tl\e state of present knowledge of 
student evaluation of faculty and the impossible desire to delineate 
tlie presentations into five distinct, but broad, areas. Examination 
of the 12 papers presented in this volume suggests common themes; but, 
if compared against what one would expect from a conference designed 
to partially resolve at least seme of the issues, the extent of agree- 
ment among papers was relatively small. Related to this difficulty, 
analysis of the taped transcriptions of the discussion following the 
presentations revealed an almost meaningless sequence of verbalizations. 
Perhaps, the mix of the participants, "lay" researchers, administrators, 
and psychometric types, compounded this problem. 

The desired order of the sessions was: General Keynote, Impact, 
Systems, Instruments, Correlates, and Discussion of Issues. Itowever, 
timing problems ai'ose, and the following order was used: General 
Keynote, Impact, Instruments, Correlates, Discussion of Issues, and 
Systems. The papers contained within this volume follow the originally 
desired order, but with one minor change* Donald Hoyt's paper was 
presented under Systems, but is contained in tills volume under Instruments 
because its contents are more closely aligned with problems of instiu- 
ment construction. 

The Conference was somevdiat pessimistically opened with Paul 
Dressel's Keynote talk. Playing tiiQ devil *s advocate, Dressel pointed 
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out Xho breadth of faculty rosijoniitibllity atvl the small role played 
by classroom teaching* Raising more questions thmt could be answered 
within the context of the Conference, one of Dressel's criticisms is 
that faculty evaluation, as pre ently done, is dominated by an inade- 
quate conception of teachii^g and is not very useful for the improvement 
of teaching. Responses to Dressel's plaints come from tlie papers of 
John Centra and Lavrrence Aleamoni in the Impact session. If Dressel 
is correct, then the impact of faculty evaluation in the domain of 
teaching must be nil. Centra's presentation and discussion of the many 
facets of impact stands as a more optimistic outlook, while Aleamoni 's 
presentation shows evidence of early research into this foundling area. 

The rationale behind the session on Systems was to inquire into 
how faculty evaluation is, and can be, done from a molar systems per- 
spective, over and above the technical details of instrument construc- 
tion. Kenneth Doyle's paper presents a comprehensive description of 
many of the considerations that should go into creating a faculty 
evaluation system, \A\ile Bruce Hickman's paper evidences the logic and 
approach to developing a micro-system for changing affective responses 
of faculty. A unique aspect of Tucknan's paper is its attempt to 
develop a system based on theory from another field. We can see that 
Systems is a relatively new, and relatively unresearched, area begging 
for more substance in order to progress beyond the realm of instrument 
construction. 

Under instruments, the Sockloff paper attempts to delve into the 
logical considerations underlying the construction of a faculty evalu- 
ation instrument. This paper attempts to develop a skeleton for a 
model of learning and teaching for the purposes of constructing faculty 
evaluation instruments and to discuss some of the pitfalls that have 
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bteu 80 carelessly noiilecteU in the construction of instnunents. Willurd 
Wiirrington's paper gives us u good idea of the histoiy of the approach 
in developing a popular instrument, while Donald Uoyt's paper takes a 
more practical, not-so-psychoinetric, and slightly humorous view of what 
is actually involved in attempting to intelligently consti'uct an instru- 
mcnt. Ricliard Perry and Reemt Baiunann show us that no matter how 
carefully an instrument is designed and constructed, the problems in 
its usago may be insurmountable. 

« 

Last, since studies of correlates of evaluation in faculty and 
student characteristics have tended to concentrate on confounding 
factors, the results of tliese studies are suggestive of the varieties 
of confounding factors vitiating the validities of the instruments 
that were used in these studies. While Wilbert McKeachie's paper 
s'Jinmarizes a potpourri of results in this broad area, Jerry Gaff's 
paper suggests the shortcomings of traditional approaches to faculty 
evaluation. Gaff uses his results to support his views on education 
and related considerations in faculty evaluation. Mary Jo Clark and 
Robert Blackburn make use of a theoretical model froiii another field 
in their study of faculty characteristics. 

Traditionally, prefaces are optimistic and sometimes laudatory 
about the contents of the prefaced volume. The break with tradition 
in this volume is meant. more to encourage, ratlier than discourage, 
future work in this area. If the concept of having students evaluate 
their teachers is to be taken seriously, it is clear that some signi- 
ficant improvements are needed in tids field. Apparently, faculty 
evaluation is a game that can be played by anyone, where demagogue ry 
is eaiily mistaken for wisdom. If tJiis condition is the result of ilie 
taddish nature of the area, and we know tl-iat fads live and die cyclically, 
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hopefully something practicable will bo leariwd before we cxperieucu 
its death and eventual revival. Althcugli the papers contained within 
this volume represent sojiio of the best work in this area, the reader 
should have little difficulty realizing that our present state of 
knowledge is rather rudimentary. 

On a more positive note, many individuals have contributed to 
the success of this first conference, and thanks and appreciation is 
extended for their help. The Measurement and Research Center staff 
has given large amounts of time and good advice throughout all phases 
of the Conference. Tliese individuals are: 

Harold C. Reppert, Ph.D., Director 

Abraham A. Panackal, Ph.D., Director of Achievement Testing 

David D. S. Poor, Ph.D., Director of Statistical Data Analysis 

J. Porter 'lUck, Director of Educational Research 

Edward Lake, Data Reports 

Terry Sendrow, Information Systems 

Estelle C. Kalstein, Office Manager 

Posey Schwartz, Convention Secretary 
In addition, Millard E. Gladfelter, Chancellor of Temple University, 
entertained us as banquet speaker, allowing us a respite from the long 
involved sessions; and Earl J. McGrath, fomer Director of the Higher 
Education Center at Temple University, willingly volunteered to moderat 
tlie Discussion of Issues and did so with aplomb under somewhat chaotic 
conditions. 

Alan L. Sockloff 
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S'llIDliNT hVALUATION OF FAQJUT: WHY? Wl-IAT? HOW? 

Paul I.. Drcssel 
Michigan State University 



I should confess in the beginning that I ani not greatly enthused about 
discussion of systematic evaluation of teaching by students in isolation 
from other approaches to evaluation of faculty services. I favor student 
evaluation, but I think that evaluation of faculty services is complicated 
and that too frequently ventures into the student evaluation of classroom 
teaching become simply a way of evading the broader problem of careful 
evaluation of all faculty activities. One of the common complaints about 
colleges and universities is that research is given prime consideration in 
the reward system and that little or no attention is given to teaching. 
Actually, I believe there are relatively few institutions in the country 
which systanatically evaluate the research output of faculty members. I 
have known many faculty members who were promoted and given salary increases 
largely because of their published research, even though maiiy of their asso-^ 
ciates (in private) expressed doubts of its worth or quality. There are 
only a few institutions that regularly collect and submit to scholars in 
other universities the research output of a person before a major promotion 
or the granting of tenure. 

Faculty members commonly engage in student advising, and there is 
general complaint from both students and administrators that faculty advising 
is grossly inadequate. Neverthelers, too little has been done to collect 
systeaaticaliy student appraisal of advising and even less has been done to 
improve faculty advicinj;. Yet in most institutions any attempt to provide 
other systejns of advising are thwarted by the insistence of the faculty that 
this is their prerogative, although usually their insistence is based on a 
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false conviction tliat the student's major is the most important factor in 
his undergraduate program. I doubt that the student majoring in a discipline 
must be advised by faculty members in that discipline, and I know this arrange- 
ment does not insure good advising. 

Faculty members also engage in extensive service on and off the campus. 
Some participate heavily, and even excessively, in committee work and in 
quasi -administrative work. Tliese services are no more adequately evaluated 
than teaching. Why, then, should the pressure be on teaching rather than on 
the full range of faculty ^rvices when evaluation by students is discussed? 
First, students often complain about teaching; hence many persons-- including 
most of the faculty- -feel that some opportunity should be provided for stude^'^; 
to present their point of view. Second, the ready availability of the class- 
room and large numbers of students involved in classes make it relatively 
easy to use a few minutes of classroom time to collect a large number of 
reactions to che course and the teacher. Third, the development of objective 
formats- -that is to say, a series of statements to which students can respond 
by checking some alternatives --makes it possible to collect a very large amount 
of data and process it speedily through use of electronic equipment. The net 
result of these three considerations is that student evaluation of teaching 
is undoubtedly the most prominent and the i»ost discussed means of evaluation 
of faculty services. Numbers and the pseudo -objectivity of the responses 
give many people a sense of false security about the reliability and validity 
of the results; yet one has only to note that, in an objective fomat, 
students respond only to items included to realize the limit nions of this 
approach. My observations on many campuses indicate that many of the more 
revealing statements which ought to be in such a form are excluded by the 
faculty as irrelevant to their conception of teaching responsibilities. 

^ At best, most of these evaluation forms foci^ on what goes on in the 
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classroom aiid on what the faculty m(3iiber does in the way of clarifying 
objectives, making specific assignments, preparing examinations, giving 
grades, and the like. Tliis is certainly not the viiole of an indiviJiial faculty 
member s performance, and it does not even include all of his instructional 
contributions. There have been individuals whose classroom performance was 
abominable, but v^o have written excellent and widely used textbooks* Indi- 
viduals adept at preparing tests and other evaluation materials may markedly 
affect the teaching of many members of the staff, yet not excel in the 
particular kinds of behavior usually involved in student evaluation forms. 
The faculty, too, may be very effective with some students while quite 
ineffective with others. As I recall my undergraduate days as a major in 
mathematics , I reconfirm my conviction of that time that most of my under- 
graduate teaching was bad, and that the mathematics teaching was deplorable. 
I did have two professors who were very effective in niy particular case. 
Both of them, in effect, said that I was wasting time in the class and would 
profit more from independent work. One professor went so far as to guarantee 
an "A" in tlie course vdietlier I did anything more or not. In both cases it 
was a welcome and beneficial release for me, and I really didn^t lose much 
time sympathizing with those students required to attend class. 

I conclude, then, from observation, experience, and some research, that 
evaluation of teacMng by students is based on a very limited conception of 
faculty services and, especially or particularly, on a limited conception 
of the teaching act itself. The dangers inherent in this approach are that 
this involvement in evaluation may have more read into it than it deserves 
and that the involvanent in time and resources may effectively eliminate any 
possibility of a broader evaluation. This last issue deserves more consider- 
ation > and I shall return to it later. 

14 
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Scaue Questions About Student Evaluation 

Several different questions about student evaluation need to be 
considered. First of all, wliat aspect of faculty perfomance do we want • 
students to evaluate? Much of the answer depends upon how we interpret 
tlie pronoun "we." We, as faculty, usually wish the it«ns in any evaluation 
form to be specific to the course, the content of that course, and our 
personal conception of the teaching act. Broader behavioral objectives 
definitive of a liberal education are generally rejected by the faculty. 
Usually the faculty do not want students to evaluate advising because they 
feel that advising is an extra duty thrust upon them for which there is no 
possible recognition or reward. In their advising, they are primarily 
concerned with majors and really have no interest in the broader aspects 
of advising that the undergraduate may find of great concern to him. 

Likewise, faculty reject the idea that students can evaluate the quality 
and fairness of an examination or the justification of specific course 
requirements. Administrators, accustomed to hearing students complain 
about unreasonable assignments, poor examinations, inability to hear the 
professor, professorial absenteeism, and the like, generally take a 
broader point of view o£ what might be evaluated by students. 

Students tJiesnselves generally take a rather narrow point of view. 
They are concerned that the professor express himself clearly, that his 
statements be audible, that his assignments be clear and not too demanding, 
that his examinations be directly related to classroom coverage, and that 
they neither require unreasonable memorization nor extensive tliought. Students 
like some clarification of objectives, but are readily satisfied with a state- 
ment of the content to be covered and the requirements to be met in terms 
of exauiinui.ions, papers, and the like. They are not encouraged to think 
about a course or the inS|J:ruction as relevant to some; of their personal 
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interests or their other courses. They are not urged to view the course 
in terns of its contribution to a liberal or general education. Students 
don't really expect that, as a result of a particular course, they will be 
increasingly capable of independent effort in the type of materials studied 
in tlie course. In short, we impose such limits on what students evaluate 
that tlie student sees each course and each instructor in isolation rather 
than as a part of a much broader and more significant cumulative educational 
experience. Generally, students are being asked to evaluate petty details 
which have little significance to them and often no significance to the 
instructor who might wish to use the student reactions to improve his 
teaching. For example, I submit that when students in large numbers assert 
that "objectives are not clear" instructors obtain little assistance in how 
to improve the situation. When many students say that "not mda. was gained 
by taking this course,"! know that most instructors assume that this response 
is characteristic of students v^o get low grades, although it may as well 
characterize the views of those who get "A's." I find it singularly 
unhelpful to learn whether a group of students believes an instructor was 
friendly to students. The best teacher that I ever had was distinctly not 
friendly to students, although he wasn't unfriendly or antagonistic; he was 
simply a busy man and impatient with any delay or interference. He 
obviously spent many hours of time preparing for his classes, he carefully 
read any examinations or papers, and he was deeply concerned that his 
students learn something of significance. He did know more about his students 
than most of them suspected, but he was never characterized as friendly. 

When students indicate that too much outside reading is required, one 
can scarcely judge whether this is a commendation or a criticism. Most of 
ray own graduate students will respond in this manner to my two seminars 
when they compare those seminars with others that ^ey have taken. On the 
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other himcl, tliey iDiimiiiious ly a«rou in a final fisscssmont of the benefits 
gained from the seminars tliat the reading lias been valuable. Students are 
frequently asked to respond to sudi aji item as "the laboratoiy was a worth- 
while experience." I have long since become convinced that most of the 
laboratory in freshman science courses is a waste of time and money, 
particularly wJien conpared with alternative patterns of experience which 
mighj; provide greater benefits. The freshman laboratory typically does 
not provide any vision of what scientific experimentation is all about; 
it's largely a cookbook and time-consuming procedure which fails miserably 
to educate the freshman student as to the nature of scientific exploration. 
Yet I agree with the faculty that most of the students are incapable of 
this judgment. Those who are would hesitate to record it in the face of 
the teacher's commitment to tiie laboratory. 

Students aie capable of evaluating much more than we permit them to 
do about evaluation of faculty effort. On the whole, they evaluate what 
we let them evaluate, and the faculty mcmliers tend to eliminate or ignore 
any aspects of student evaluation that might materially change the prevalent 
faculty conception of teaching. 

tVhat is good teaching? A sinple answer is that good teaching produces 
effective learning, but that leaves open a wide range of views as to what 
constitutes good teaching. 'Ihe individual who teaches mathematics as an 
end in itself follows the textbook Jind presents to the students a series 
of problem types. Generally speaking, he assijines that the students cannot 
read the material in the textbook whicli was rewritten by a professor to impress 
otJier professors rather tlian for tlie students, licnce, the teacher uses 
classroom time to make an exposition of the theory and work a number of 
problems of the same type. Ultimately, the examination sajnples these various 
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problcnis and pennits the student to eani his grade by demonstrating that 
he can indeed do what he has been asked to do from day to d^y. Seldom 
does he understand the theory which was developed for background in 
solving the problems. He may not have the least idea of their utility. 
Hie likelihood is that within a few months after con?)leting the course he 
(unless he continues in mathematics) will have little recollection of 
the materials covered, less of what to do with particular problems, and 
almost no sense of the nature of mathematical reasoning and its widespread 
application in other fields. 

Good teaching in the eyes of many faculty members is simply coverage 
of particular materials demanding certain knowledge and skills and testing 
to see that these have indeed been acquired for the moment. The development 
of broader abilities, attitudes, and insights v^ich might enable the person 
to apply something of what he has learned to pursue independent study in 
the field- -these and other broad liberal education outcomes are ignored. 
I am reminded of an individual who, by most standards, must be regarded 
as having been a very capable professor and dean who wanted help in 
evaluation of a freshman course, but rejected any attenpt to state explicit 
objectives on the ground that the course was a first course which prepared 
to take a second and the second prepared to take a third, etc. until 
finally, if a person took enough courses in that particular discipline, 
he might be capable of doing some tiling with it. 

In short, my major concern about the typical approach to student 
evaluation of faculty is that it is ultimately dominated by a very inadequate 
conception of teaching and learning. At best, professors present a little 
better and students temporarily learn a little more of material which has 
limited, if any, long-term significance. Hie usual approach, which starts 
with students wiio know no better and works through faculty who studiously 
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avoid any approach whicli would require a broader conception ot* the teacher- 
learning process, means that we sijnply rcconfim what exists. I sincerely 
doubt tliat teaching has been very much ijnproved on any campus by the use 
of student evaluation fomis. If their advent is marked by insistence that 
tlxese become available to chaimen and deans, the battle lines are clearly 
drawn and ultimately the faculty will revoke that requirement. If the 
evaluation is optional with the faculty members , or at least optional in 
terns of their revealing it to chaimen, deans, or others, they will generally 
use it only to the extent to which complimentary reactions by their students 
are passed, on for whatever benefits may be accrued while other reactions 
are ignored as irrelevant or as beyond the capability of student judgment. 
The Process of Student Evaluation 

The objective teacher rating form is so extensively used because of 
its convenience that other means of involving students in evaluation of 
teaching are overlooked. Any instructor, seriously concerned about his 
teaching , can learn much by careful observation of his students , by inter- 
views with individuals, by classroom discussions, or by requesting essay 
comments to several questions at the end of examinations. Students may be 
reluctant to express some of their concerns directly to the instructor, 
but this in itself constitutes an evaluation of great significance. The 
instructor v*io cannot convince his students of his ability to separate 
his evaluation of student performance from student evaluation of the course 
or of his own performance has thereby identified a major deficiency. Until 
and unless he can tolerate frank discussion and criticism, he is unlikely 
to improve. 

Yet students who are, on tne v;noIe, charitable in their appraisal of 

teaching may be uiwilling to express their most critical concerns directly 

^ to an instructor. They may be ^ven less willing to do so with departmental 
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chairmen, instructors' colleagues, or deai\s. An expert interviewer, 
evaluator, and observer can bring ouv views and behavior not readily 
expressed or apparent to tJie teachor himself because of preoccupation with 
his own activities. My own experiences with such classroom observation 
convince me that few professors can appraise the quality of a discussion 
or are even awart that, in what passes for a discussion, they may talk for 
40 or 45 minutes out of 50. I have, incidentally, verified this by use of 
a stop watch! 

Some professors who reject objective check-lists and other objective 
formats are willing to use open-ended essay responses to questions or to a 
suggested list of course factors or characteristics, I rather like the 
critical incident approach or a request for comment on the best and worst 
aspect of a course. These do not lend themselves to generating norms. Hiis 
is an advantage, in my judgment, for if evaluation is to be focused on 
improvement, evidence that an individual teacher is above or below average 
is not only irrelevant, but it may so affect the individual that he will 
not strive to improve. If already well above average- -why bother? Seek 
rather for a raise or a promotion. If below average, an injured ego may 
indeed seek retribution on the students or undertake to discredit the 
entire system of evaluation. Teaching, like learning, is a very personal 
experience. Norms are no more conducive to improving teaching than to 
improving learning. 

I have visited canpuses in which students are encouraged to write 
letters, fill out forms, visit the dean, or in other ways present their 
complaints (or commendations) about teachers. The sampling here may be o* 
concern to some, and the motivation of some of those using this approach 
may be suspect. But the extent to which such letters are writto-jn and the 
nature of the canplaints registered involve sane student behavior beyond 
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the quiescent response tc a form passed out in the classroom. 

There are other aspects of student performance which are relevant to 
evaluating teachii^g. The extent to which students elect a course or a 
particular faculty member surely indicates evaluation of the worth of that 
experience. If common examinations of any kind are used, eitlier for a 
course or some group of courses taught by the same individual, the examination 
performance of the student.*? is certainly an evaluation of the teaching, 
although one must hasten to add that high performance on the examination 
has to be weighed in reference to the nature of the examination itself. 
Personally, I should not regard as an excellent teacher a professor whose 
students all made high grades on a very factual examination, although I 
know seme faculty members who would be delighted by that evidence. Neither 
would I be happy with a high level of forced performance which resulted in 
avoidance of the field thereafter. 

One aspect of student evaluation that interests me greatly and which 
is, I think, done the least is that of investigating changes in student 
behavior outside of the class and in following years. Some years ago I 
found on a college campus several groups of students in their senior 
year who were meeting bi-weekly to talk about developments in the natural 
sciences. These sessions had started spontaneously in the fresliman year 
because of a course required of all students as one of the general education 
group. This course dealt in part with current developments in the sciences, 
and students became aware of certain kinds of magazines and reports, and 
they banded together for meetings to read and discuss these. Several of 
these were continuing three years later. I can think of nothing more potent 
in evaluaitng the effectiveness of a professor than the stimulation he 
provided for a group of students to continue their interests in an area 
originally forcibly brought to their attentioja by a freshman requirement. 
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But in a broader sense, if we have not, in teaching a course, given an 
individual some ideas, some teclmiques and insights of ability to do 
indepeiulent study on his own in that area, we Imve really given him 
nothing of significance. And it is generally the failure to deal with 
these broader behavioral outcomes that leaves me relatively cold to the 
usual practices in student evaluation. 

Incidentally, some institutions have undertaken evaluation of teaching 
by aliimni. I have some doubts about this approach because a few years 
after leaving college a student will have had such a variety of experiences 
that his recollection of contacts with specific instructors and courses as 
an undergraduate is likely to be far from accurate. Furthemore, there is 
a tendency in retrospect to see one's experiences through rose-colored 
glasses and perhaps to become more charitable of professorial weaknesses 
simply because of becoming aware of the extent to which people generally 
perform less effectively than might be desirable. 
Uses and Benefits of Student Evaluation 

In this section I propose to raise the general question of why we 
should encourage student evaluation of teaching. And again the answers are 
somewhat different depending upon our interpretation of "we." Students who 
become interested in some rating and reporting on faculty, at least in my 
experience, seem to be motivated largely by two considerations. (1) They 
have had seme unfortunate experiences and, in some sense, they would like 
to record somewhere tlieir dissatisfaction. (2) They hope also that by this 
means they might warn other students to avoid cei-tain courses or instructors. 
Beyond this, some students hope that, by the publication of reports which 
reveal the poor quality of teaching, the reward systan will be brought to 
bear upon these people, forcing them to improve or leave. I have no adequate 
basis for assessing the impact of student-conducted evaluation and reporting. 
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M>r impression, however, is tliat the magnitude of student interest in such 
surveys is far less great tliun the student initiators thought would be the 
case, and I am generally convinced that the iinpact of tliese published reports 
on the faculty is minimal. Even as a visitor to campuses using this approach, 
I have bjen distressed by some of the statements which have been published, 
particularly about young faculty members or teaching assistants for whom 
some alternative method of pointing out attention to weaknesses should have 
been used rather than a published report. And, although on the whole I have 
felt students were charitable in their interpretations, the sheer inexperience 
of students in evaluation and their lack of understanding and lack of sensi- 
tivity, exhibited by some of the students in writing about the teaching of 
individual professors lead me to question the worth of such enterprises. 
Evaluation of teaching is a complex and difficult task. 

A second possible use of student evaluation is with reference to the 
reward of faculty members and the assignments vdiich are given them. Students 
would like to have something to say about promotions, granting tenure, and 
possibly the granting of salary increases or other forms of recognition to 
individuals. Many of them feel, with some reason, that reports on the quality 
of teaching ought to be used to eliminate or to reward professors rather than 
simply be collected in the vain hope that individuals will be inspired to 
improve their teaching. In many respects, I agree with the students, although 
I have seen more faculty members antagonized by student reports of inadequate 
teaching than I have who were motivated to improve. Indeed, I have seen few 
departments in which a significant proportion of the staff felt any confidence 
in their ability to appraise the teaching of the associates and, considering 
the lack of adequate means of appraisal, I tend to be quite skeptical of 
departmental assessments of good teaching. For example, I recently 
visited an institution with a Doctor of Arts program under way. Members 
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of til© department reported on how they, were going to evaluate the required 
intemship of each D.A. candidate. I was given some reports from recent 
visiting committees. These tended to criticize the novices for overly 
informal classroom procedures— sitting on the edge of the table, leaning 
against the wall, etc., as against apparently the faculty preferred 
stance, front and center with manuscript or notes on a podium. Another 
common critical comment about the intern's teaching procedure was the 
inadequacy of the lecture, its organization, or its depth. I also noted 
criticisms of certain aspects of lectures as indicatijig that the intern 
was not sufficiently sensitive to the underlying facts in some of his 
statements. Out of this came the reccramendation that the student be 
required to take one or more additional graduate courses so that he could 
be more precise in his treatment of these matters. I doubt that teaching 
will be much improved by this approach. If departmental faculties really 
understood good teaching, we would have less of a problem with inadequate 
teaching. As it is, a new degree may not improve the situation. Ihe 
Ph.D. surely does not train people for teaching and, if most of our 
faculties have no conception of teaching except that of the scholar 
delivering well-organized packages of knowledge to his students, 'improve- 
ment may be difficult via a new degree. 

f Quality of teaching shotad be a major factor in the reward system, 
but I do not believe that student ratings of teaching are an adequate 
basis for doing this, nor am I sanguine about many colleges or departments 
having a sufficient number of professors with a well-thought-out conception 
of what good undergraduate teaching is to feel sure that we can readily 
introduce any system capable of recognizing and rewarding good teaching. 
And furthermore, it is significant that, in collective bargaining, as it 
has developed in public schools and now gradually expands in higher 
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education, the tendency is to avoid any approach which depends upon merit 
or good teaching. My own intei-pretation of this is that most faculty mem- 
bers are, first, not willing to admit they are not good teachers; second, 
they cannot admit, or will not admit, that there is sufficient agreement 
on what constitutes good teaching to pick out individuals for special recog- 
nition; and third, especially in colleges and universities, if students were 
carefully selected in the first place because of their enthusiasm for learn- 
ing there would be no need for concern about good teaching. 

Another reason for student evaluation projects is found in the research 
interests of some faculty members (often psychologists working with a sopho- 
more sample) . I have read much research on the qualities of good teachers 
and on the effectiveness of different methods of instruction. The cumulative 
impact of all of this research essentially is nil insofar as providing any 
guidance about how to improve teaching. The generalizations are suspect 
and of little use, for improving teaching is ultijnately the process of work- 
ing with individuals. I recall being told years ago in an education course 
that the use of sarcasm by a teacher was quite undesirable. I was immedi- 
ately led to thinic about a number of professors whose gentle use of sarcasm 
needled students to think more deeply about an issue. This is only a simple 
example of someone's attempt to devise (by rationalization or research) a 
general and apparently reasonable principle of very limited validity. Prof. 
McKeachie argues that there are some general statements which can be made 
about the effectiveness of various methods of instruction. But with all 
deference to my good friend, I continue to doubt that we know anything 
about the relationship of any generalized method to specific outcomes. 
In the first place, I have grave doubts about studies which characterize 
relationships between methods and outcomes. In most cases, when I 
have looked at them closely, I have found that the so-called methods were 
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net clearly defined or consistently used, and were often contaminated by 
other factors. Karlier I mentioned a case of a teacher who had talked 
for 45 out of 50 minutes. At the close of the class he remarked to me 
that this was the best discussion session he had had in some tijne. In 
another study, I found the majority of a control group regularly meet- 
ing with the experimental group because they found the latter 's 
experiences were more exciting than their own. 

A second problem is -that the differences in method should be 
related to the objectives that the professor has in mind. I find few 
professors deeply concerned about objectives involving personal develop- 
ment, affect, values, or oven the development of increasing independence 
and self -direction. In a study last year, I found one professor giving 
a lecture three times a week to a student enrolled in independent study. 
Yet both the professor and the records characterized the student's 
experience as independent study. 

I do not object to research on the nature of teaching and learning. 
In fact, we need much more fundamental research than we have, but I would 
point out that research and evaluation are very different things. Research, 
in the long run, may provide us sane insights from which we can move toward 
imi^rovement ; but the concerns of students and of critics of higher edu- 
cation are that we do something about improving teaching right now. This 
is evaluation. 

Certainly from the point of view just mentioned, and probably 
from the point of Vxew of this cpnference, improvement of instruction and 
of learning represent the two major concerns which justify evaluation by 
students. We need to note that in this process of improvement of instruc- 
tion there are some problems which, in effect, negate improvement, 
fividence will not improve instruction if that evidence is also used to 
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deny a salary increase, negate a promotion, or decide a tenure action. 
Somehow, these administrative decisions and the process of improvtaneut 
have to be separated from each other. For new instructors joining a 
staff, emphasis can well be on improvement of instruction with enough 
time lapse so that if two or three years later it becomes cloar that 
the individual will not or cannot improve then appropriate action can 
be taken. If collection of data on the quality of teaching becomes 
available only at a point in time when a decision is to be made, then 
most faculty members will only resist, fight, and attempt to deny 
the validity of any undesirable information which accrues. 

Any attempt to relate evaluation to the improvonent of instruction 
and also to decisions about individuals will generate real difficulties. 
As has been true in so many cases, the attempt to develop an evaluation 
schone involving student response generates a faculty demand that this 
be handled as a confidential feedback to individuals wlio may or may not 
see fit to share the results with others. This leads to a pattern of 
optional reporting or consultation in vAiich individuals utilize only 
so much of an evaluation as they find suitable to their purpose- I 
have in a few cases learned of at least a temporary situation in which 
reports were placed with a department chairman and the faculty member 
was asked to sit down with the chairman for a formal discussion of 
the student ratings. I would have a great deal more confidence in this 
if I felt that most department chairmen were sensitive to what good 
teaching involves. Required consultation would be helpful if it could 
be used as the starting point of a program gauged to the needs of the 
individual professor and if it could help him, over time, improve the 
quality of his teaching and finally culminate in another reporting v^ich 
vould demonstrate that improvement. Those universities which have been 
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able to set up a program of services to help professors analyze their 
course, course objectives and materials, to develop new materials, make 
use of educational technology, and other means to improve the quality 
of the learning of students have, I am convinced, done a great deal to 
improve the quality of teaching and learning. The difficulties I see 
here are twofold. First, there are not large numbers of professors 
ntfio take advantage of these services and, if more were encouraged to 
do so, tlie costs might readily become prohibitive. Second, observation 
over a period of time indicates to me that individuals who become deeply 
concerned about their teaching and take advantage of all of these 
possibilities tend shortly to become involved in other activities. They 
may become administrators , they may become involved in committee work, 
sometimes they become consultants on these matters, and end up by 
retreating to a lower quality of instruction simply because they become 
so much engrossed with other matters or have moved to new assignments as 
a result of the venture into improvement of teaching and learning. 

I would make another remark about encouraging faculty to look 
at their teaching. At the present time, \dien recommendations are made on 
faculty members, we usually lack the information required to detemine 
whether a person is a good teacher or not. The individual, backed by 
his fellow faculty members, insists that if there is no evidence to 
demonstrate that he is an inadequate teacher then we must assume that 
he is a good teacher. And so we do. We could change the situation by 
informing everyone who joined the faculty as an instructor or assistant 
professor that he would not be promoted or given tenure until he provided 
convincing information about the outstanding r jality of his teadiing. 
In short, throw the burden back on the individual and then make available 
to him the help and the services to gain that information. I have nut 



18 



yet seen any institution that was willing to take this approach and, as 
collective bargaining becomes more widely prevalent, it may become 
impossible. 

Another use or benefit of student rating is to alleviate student 
concerns and perhaps develop some good will by giving students an 
opportunity to participate in facility appraisal. In some institutions 
students actually sit on committees with faculty in passing judgments 
on promotions, salary increases, and tenure. At this point, I am sure 
the student voice must have come ijnpact, I doubt, however, that the 
usual student evaluation has any impact on the departmental reconmen- 
dations with regard to individuals. Thus, in a sense, we gull the 
students into believing that their voice is heard, but actually ignore 
it, except that student appraisal of teaching does at least tend to 
promote faculty awareness of student reactions. 
Possible Detrimental Effects of Student Evaluation 

In accordance with ray attempt to analyze the benefits of student 
evaluation, I should also consider the possible detrimental effects. 
Dne major point that I have already made is that tl^e usual approach to 
student evaluation involves much too limited a conception of teaching. 
This limited conception of teaching has a two-way impact. On one hand, 
it allows the student to continue to think that teaching can be evaluated 
primarily on \^at goes on in a classroom situation. My own commitment 
is that teaching is more properly evaluated by the inspiration vdiich it 
gives to the student to carry on his learning beyond the classroom 
situation, A second and related concern is that student evaluation, 
in the usual pattern, deals in generalities vdiich have little to do 
with good teaching. The opportunity of a student to react to a statement 
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tRat the objectives of the course are not clear may say something 
about the statement of objectives either in the departmentally-prepared 
syllabus or in the lecture as prepared by the instructor. In either 
case, it says very little about the appropriatness of these objectives 
or the extent to which the objectives carry beyond the specific content 
covered to the development of general abilities, insights, and values. 
The indication that the laboratory was or was not a worthwhile experience 
tells, at best, from a rather limited student point of view, idiether 
the laboratory experience seemed worthwhile. The student has no basis 
for determining whether the laboratory was as effective as some other 
€3cperience might have been, and he certainly has no basis for weighing 
the costs of the laboratory against possible demonstrations of some 
of the ideas conveyed through the laboratory experience. Such state- 
ments a: this have very little directly to do with good teaching, and 
they provide no information wnich can be used as a basis for improvement. 
Most students may, given the statement that the instructor did or did 
not synthesize, integrate, or summarize, will respond to this in 
unsatisfactory or meaningless ways. If the instructor regularly, at 
the end of each class, attenpts to summarize what he has covered, the 
students will probably recognize this. Nevertheless, that attempt to 
synthesize, integrate, or suimarize may be grossly inadequate in teiros 
of the immediate material and even less adequate in terms of the long- 
term development of concepts and principles in the course. In short, 
the fact that the instructor is noted as summarizing does not at all 
mean that he summarized well. Earlier we noted also that the ease 
with which student evaluation on a mechanical basis can be carried out 
makes this a very popular approach. At the same time, the involvement 
of time and energy in this approach becomes an ©ccuse for not going any 
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further in the evaluation of teaching and learning. In a sense, the 
detrimental effect here is that we yield to criticism by doing some- 
thing, but choose to do something that is inadequate to correct the 
real problem which generates the concern. Putting it another way, 
we react as little as possible. 

There is a need for balance in evaluation, and balance must be 
interpreted to include many things. The adequacy of the classroom 
situation itself needs to be evaluated. If too hot, too crowded, or 
too noisy, attention and learning will suffer. The objectives need 
to be examined in some depth. Many courses, especially in colleges 
and universities, have- no formal statements of objectives, but simply 
assume that the materials covered are objectives in themselves. The 
objective is to cover the material without thinking through or really 
being concemed about the results in terms of new insights and abil- 
ities on the part of the individual student. The student is examined 
on how much of the material he has memorized. When objectives are 
unclear or L*adequate, evaluation concentrates on the process. But 
iji^rovement of the process is impossible unless based upon improve- 
ment in learning with regard to objectives. If these are regarded as 
inadequate by qualified observers, improvement is not possible. What 
the faculty member does (which is a part of the process of education 
that goes rn in the classroom) and what the faculty member expects 
or requires cf his students outside of the classroom also should be 
related to objectives. The culminating aspect of evaluation is always 
with regard to learning by the students. What have they achieved? 
And at this point, evaluation must not focus simply on what they have 
achieved in terms of the originally stated objectives, but also in 

^ terms of other by-pr9ducts, side issues which may not have been 
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contemplated; A student may have indeed met certain required knowledge 
goals in a course, but come out disliking the area so much that he vows 
never to have further contact with that field. In this case, the 
significance of the unanticipated outcome completely negates the actual 
gains made with regard to this specified outcome. 
Cost Benefit Analysis 

At this point, I shall undertake to draw together several strands 
of thought to deal vdth the general question, does student evaluation 
achieve benefits in proportion to the costs in time, energy, morale^ 
dollars? should note that many student evaluation programs require 
the use of a class period or part of a class period. What is intended 
to be 10 or IS minutes for a response to a form often, by student 
contrivance, extends to 30 or 40 minutes, Bven if the student is asked 
to take the foxm home to respond to it (at the risk of reducing the 
response total and polluting the response by discussion with roomnates) 
some class time is usually required for passing out forms and explanation. 
But generally speaking, the amount of connittee and administrative 
time involved in the preparation of a student rating fom is the most 
expensive aspect of the whole process. Ify own experience indicates 
that faculty members are likely to insist that any evaluation form be 
thoroughly reviewed by a local committee, which probably means several 
tryouts, an extensive amount of work by some staff members, and a great 
deal of editorial work and elimination by the comnittee. Ihe instrument 
coming out of a university comnittee usually is, by faculty insistence, 
circulated to departments for reactions, with the result that many 
of the more significant items (at least in my estimation) have been 
eliminated as irrelevant. No sooner is the instrument given than there 
are faculty criticisms and a demand for elimination of certain items 
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and for review and revision of others. Typically, we have found at 
Michigan State that some faculty resist and will refuse to use ai:y 
instrument in which statements are made to which students are expected 
to react. They demand an open-ended form with a series of questions 
to which students write an essay response. Their insistence that no 
objective format is meaningful in providing specifics for improvement 
is one with viiich I sympathize. I cannot avoid noting also that, in 
providing an essay response, the student almost totally negates any 
attempt to sunmarize student reactions in the form of norms. 

In addition to these costs in time (which are seldom estimated) , 
there are cash outlays for printing, scoring, and compiling norms. 
There are further staff time involvements in the many consultations 
with individuals within departments, with various coninittees, and the 
like. In any large university, I am quite sure that any careful 
assessment of the costs of student teacher rating forms would be of 
the magnitude of $5,000 or $10,000 per year. And in those years 
(probably every two or three) in vhich a major revision is required, 
the total costs, including all of the time of the many persons who 
become involved, may well run to $40,000 or $50,000, The question, 
then» that one has to weigh is v^ether the gains by the expenditure 
of funds in this way are justified in terms of the benefits gained. 
If I were to summarize the benefits of student ratings as I have seen 
them operate at Michigan State and other institutions vihere I have 
consulted, it would be as follows. First, the involvement of students 
in rating faculty is evidence of concern about the quality of teaching. 
Second, administrative support of such student ratings and financial 
support for the total process indicates an administrative position which 
favorably influences the student, although it may be rejected by the 
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faculty as an invasion of academic freedom and of departmental and college 
privileges. Third, the extensive discussions that are involved on any 
campus v^en the matter of student rating of teaching is under consideration 
probably have some educational value. Members of the coimittees and 
others v^o become involved are led to think through some of the character- 
istics of good teaching, and this may have an Influence on some of them 
transcending any direct benefits viiich come from the use of the forms 
which ultimately result. It would be very difficult to assess each of 
these educational benefits. Ify own observation leads me to believe that 
the discussions at the foimative stage of such a program may be the most 
valuable result of the whole venture. Fourth, the development of a 
student rating project na^ affect hiring and reward criteria. I 
underline "may** because, in those situations }iAiere 1 have had any 
chance to observe, my conviction is that the lapse in tijne and the 
almost complete separation between programs of student rating and procedures 
for selecting new faculty make it very unlikely that there is anything 
more than the most general consciousness about teacliing which carries 
over from the evaluation program to the selection of faculty. It has 
probably happened, but I have yet to learn of a faculty member who was 
asked to present student ratings on his teaching in applying for a 
position elsevdiere. 

My tentative conclusion from this review of student rating of teaching 
are the following: 

1. The usual faculty and student conceptions of the nature, 
objectives, and obligations of teaching and learning (bound by traditions 
and limited by experience and bias) simply do not provide an adequate 
basis for student evaluation of teaching, 

2. IMless based upon a conception of objectives and of teacher 
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obligations beyond tlie traditional classroom, the impact of student 
evaluation is very limited. It iiuiy indeed ho more of a distraction than 
a benefit. 

3. Student evaluation alone, whether by stnactured inventory or 
other means, is obviously not an adequate basis for judging total faculty 
effectiveness. It is also inadequate for assessing teaching effectiveness. 
Hence, unless balanced by other evidence, reliance on student evaluation 
may be botii inequittible and dangerous. 

4. l\iblished student evaluations are not very useful to faculty menbers, 
are probably used by a reJative minority of students, and they may be grossly 
unfair to junior members of a faculty whose careers are still in a formative 
stage and who should be receiving concrete positive iielp in inproving their 
teaching rather than published criticisms made by naive individuals whose own 
conception of teaching, formed as it has been by their college e:q)erience, 

i-s grossly inadequate. 

5. Finally, this paper has enphasized that there are other forms of 
student evaluation and rating scales, and th£.t there are many other aspects 
of evaluation of faculty services which have some relationship to teaching. 
My own conviction, tlien, is tliat, in any institution in which ^.here is 
concern about faculty performance, those involved in developing an 
evaluation program should think through in tlie broadest terms the obligations 
and activities of faculty and attempt to develop a complete evaluation 
system. After this hfis been done, several different ventures may be 
developed in terms of evaluation of aspects of faculty performance. I'm 
certain tliis will result in a realization that there are more facets and 
more interrelationships among these tlian student ratings can possibly 
provide. I believe that our approach to defining and collecting student 
ratings of teaching will be redefined if related to a broaaer concern about 
what faculty do and ha\; well th^y.do it, ^ 



nUi S'lUDHNT AS C^DPATHnR? 
mi IMPACT OF STUDiiNT RATINGS ON ACADEMIA 
John A. Centra 
Educational Testing Service 

Most of you, I'm sure, are familiar with the Godfather role made 
popular by the very successful book and movie. He was depicted as 
someone with a great deal of power over people and viewed by most with 
a mixture of awe, fear, and respect. In fact, his "offers that one 
could not refuse" were indeed, as some of you will recall, quite 
compelling. 

There are some \iho fear that the college student, by virtue of the 
apparent increasing emphasis on student ratings of professors, could 
become the "Godfather" of the academic community. More exactly, they 
fear that too much emphasis could be put on these ratings and that, 
generally speaking, the power that students mi^^it acquire would not be 
in the best interest of the academic community. 

These Cassandras can, in fact, point to the medieval universities 
as an example of unreasonable student influence over teachers. As 
Hastings Rashdall tells us in his writings about the medieval European 
universities , students at the University of Bologna not only paid 
teachers a "collecta" or fee (which apparently was determined by a 
teacher's ability tc haggle) , but they also could report teacher irregular- 
ities to the rector. For example, law texts were divided into segments, 
and each instructor was required to cover a particular segment by a 
specified date; to enforce this statute, tlie rector appointed a committee 
of students to report on dilatory professors, who were then required to 
pay a fine for each day that they had fallen behind. 
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IVhile Cow people would take seriously the possibility that students 
are on the verge of assujiilng the role they played in medieval days, some 
do question the ultimate impact of student evaluations on teaching and 
learning. I will be more specific about some of their reservations later 
in this paper. In addition, I plan to discuss evidence of the positive 
effects of student ratings, and finally, since the impact of student 
ratings on certain aspects of academic life is not totally known, I will 
speculate about some possible consequences. 

I 've grouped m>' comments within five categories and will discuss 
the impact or possible impact of student ratings on the individual 
instructor, on teaching generally, on students, on administrators, and 
on the college. 
The Individual Instructor 

First, let nie begin by discussing the person the ratings are meant 
to influence most: the individual teacher. There has been a good deal 
of skepticism over liow much effect the ratings actually have on changing 
or ijiiproving instruction- -particularly when the results are seen only by 
the individual teacher. Faculty conservatism, when it comes to educational 
changes, has been a well-known tendency, although there are signs that it 
may be less true now than in the past. For example. I recently had 
occasion to look at the responses of some 2800 college teachers to the 
question, "When did you last make changes in the teaching methods you 
are using?" Aiiout a fourth indicated that they had never made changes. 
On the other liajid, about half said that they had changed their methods 
during the past two ycnrs. So it looks as if we should not indict all 
college teaciiers with the thiie-wom stereotypes of stodginess and tradi- 
tionalism. Many apparently are wiUing to change their methods. 
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'Ihc question, though, is what causes teachers to change and, more 
geimiic to ny topic, can ratings by students lead to any noticeable 
changes among college teachers? While a few investigators have noted 
that the ratings that teachers receive seem to improve over time, we 
know that we cannot assume a cause and effect relationship. Those 
changes could have been caused by any number of factors other than the 
initial student feedback. 

One of the best ways to investigate the effects of student ratings 
on an instructor's practices is to employ an experimental design in 
which random groups of teachers receive feedback from students while 
other teachers— those in the control groiQ)s--do not. As some of you 
know I completed such a study within the past year with the cooperation 
of over 400 faculty meirbers at five colleges. The details of that 
study are presented elseiAere (Centra, 1972), so I won't take the time 
to repeat them. But I would like to discuss briefly the results. The 
major conclusions of the study were, first, that changes in instruction 
(as assessed by repeated student ratings) occurred after only a half 
semester for instructors whose self-evaluations were considerably 
better than were their student ratings. If, in other words, teachers 
were especially "unrealistic" in how they viewed their teaching- - 
unrealistic relative to their students' views, that is --then they 
tended to make some changes in their instructional practices, even 
though they had only a half semester to do so. I might add that such 
variables as the subject area of the course, sex of the instructor, and 
number of years the instructor had taught did not distinguish which 
instructors made changes; or to put it another way, none of the subgroups 
of teachers formed by these variables were more likely to change, llie 
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second conclusion was that a wider variety of instructors changed if given 
more than a half -semester of time and if they had some miniinal infoimation 
to help them interpret their scores. Let's consider briefly the implications 
of each of these findings. 

Starting with the first result, why do you suppose changes in teaching 
procedures were related to the discrepancy between self-evaluations and 
student ratings? Actually this result was predicted at the outset of the 
study because there was fairly good reason to expect it, based on social 
psychological theory. As a matter of fact there are several similar theories 
that help explain the finding. Most are referred to as self-consistency 
or equilibrium theories, the central notion being that an individual's 
actions are strongly influenced by his desire to maintain a consistent 
cognitive condition with respect to his evaluations of himself. What this 
means is that when student ratings are much poorer than an instructor's 
self -ratings, a condition of imbalance (Heider, 1958), dissonance (Festinger, 
1957), or incongruency (Newcomb, 1961; Secord § Backnan, 1965) is created 
in the instructor. In an attempt to become more consistent, or in more 
theoretical terms to restore a condition of equilibrium, the instructor 
changes in the direction indicated by his students' ratings. 

These theories assume, of course, that most instructors place enough 
value on collective student opinion, and that instructors know how to go 
about making changes. Undoubtedly some teachers merely write off student 
judgment as unreliable or unworthy, and for these individuals, changes are 
unlikely even tliough they may be called for. At least the changes are 
unlikely if the only motivation come? from within the individual teacher. 
Increasingly, however, student ratings of professors are becoming public 
information, and in thes€| instances there is undoubtedly a good deal of 
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social pressure to change. In fact, not only is there social pressure, 

but in some instances there is economic pressure, since the ratings may 

be used in salary and tenure deliberations. But as I've said, it is not 

always clear to the teacher how to change, if indeed he or she believes 

the change would be an improvement. And this leads me to the implications 

of the second finding from ray five-college study. 

I mentioned that with additional time and with some interpretative 

information, the ratings for a more diverse group of teachers had changed 

in a positive direction. Not surprisingly, many teachers need more time 

to change their procedures, partiailavly in those areas that cannot be 

quickly altered (clarifying course objectives, for example). Yet if 

student ratings are to have maximum impact, I believe we need to do more 

in interpreting the results to instructors and in helping them iniprove. 

One of the reasons that we need to help instructors interpret their 

ratings is that the ratings are typically skewed in a positive direction. 

Most of us already know this, but the average teacher does not. On a 

five-point scale, he views his mean score of 3.6 as above average, when 

actually it may well be only average or even below average if compared to 

other teachers. Parenthetically, I might add that instructor self- ratings, 

not surprisingly, are skewed even more positively than student ratings. 

And faculty peer ratings based on classroom visits, according to some 

data I've recently collected, are also generally more favorable than 

student ratings. In any event, some kind of noimative or con^arative data 

is important for interpreting student ratings, and, perhaps, the more the 

better. The instructor might be given the choice of comparing his students' 

responses to those of other teachers at his institution, or to those of 

manbers of his department; or perhaps he may prefer a more cosmopolitan 
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comparison- -such as to instructors from a sample of other initi tut ions, 
or perhaps to a national sample of teachers in his field, llie point is 
that a variety of comparisons might be made available to the instructor 
so that he can decide which are most meaningful. 

Some of these comparison data are already being made available to 
instructors, though not always with the variety I've suggested. But I'm 
afraid that they do not totally solve the problem. There will sLill be 
some instructors who need special help, and for this reason Kenneth Eble 
(1971) , for one, has suggested that individual instructional counseling 
be made freely available. A teacher counselor might not only help 
instructors interpret their student evaluations but could, of course, 
also suggest particular ways in which to improve. A few institutions 
are already doing this, but in these times of tight money this will 
probably remain a limited endeavor. 

I'd like therefore to mention another possibility that I'm now 
pursuing. In place of an individual counselor I would propose substituting 
tlie next best thing: the computer. One of the rejuarkable feats of the 
computer is that it can be progranmed to produce a verbal interpretation 
of a numerical summary. Rather than means, standard deviations, or 
percentile ranks, each professor could instead get several paragraphs of 
prose telling him how he differs from his own expectations and how he 
differs from some predesignated group, such as other teachers in his field, 
'fhe number- leery professor need not worry about whether his scores are 
significantly different- -the computer will make that interpretation. 
Moreover it would even be possible to refer the instructor to specific 
materials, books, or even video tapes pertinent to his weaknesses. For 
example, if students said his course objectives were not made clear, or 
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if they rated the quality of exams poorly, there would be several excellent 
references dealing with these topics suggested to the instructor. In. 
fact, there »s really no need to rely on the computer to produce these 
suggestions— we ought to be doing that soxt of thing right now. 

Before moving on to discussing other categories, Vd like to make 
one last point regarding the effects of student ratings on the individual 
teacher. With the emphasis generally put on mean scoi^es or percentile 
ranks of scores, I'm afraid that the individual teacher is being influenced 
to see his class only as a homogeneous glob. Anyone vdio has taught knows 
that quite frequently there are several types of students in the typical 
class, each of vMch may be reacting a little differently to the teacher 
and the course. These different types and their various viewpoints do 
not mean that the ratings are unreliable in the sense that there is a 
great deal of fluctuation or inconsistency in student responses. We laiow 
that student ratings are reliable, as indicated by the numerous intraclass 
reliability studies that have been reported. What I'm talking about is 
identifying subgroups of students vdio differ systematically in their 
ratings. Is there, in short, some rhyme or reason to the diversity of 
viewpoints that may exist in the typic<»l class? 

One way to investigate this question is to use factor analytic 
techniques that allow one to group individuals rather than items as is 
usually the case (see Hicker § Msssick, 1963) . The only study I have 
found that looked at this question had investigated students* general 
notions about types of teachers rather than their specific ratings of 
individual teachers (Rees, 1969). So I've undertaken some additional 
analyses- 'first with three large classes separately and taen across a 
larger sample of courses- -which indicate that there are frequently three 
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or sometunes lour points of view represented in a single class. Each 
of these groups sees various aspects of the course or the instruction they 
are receiving somewhat differently from the other groups. One group, for 
example, may have rated the instructor as generally ineffective, but at 
the same time indicated that the instructor was well organized and usually 
accessible; another group miglit have rated the instructor as ineffective 
and inaccessible. Unfortunately, I don't at this point have enough 
information about student characteristics that would allow me to describe 
the groups. Ultimately, however, it may be possible to alert the individual 
teacher to relevant subgroups or points of view in the class; these points 
of view migi'.t be identified by student characteristics iiiformation, or 
they might be identified by patterns of ratings. Until then, teachers 
should be encouraged to look at the distribution of student responses to 
the items on their rating form--and not only at the mean scores. While no 
one expects them to please all of their students all of the time, instxactois 
ought to be aware of how they interact with different segments of the class. 
Impact on Teaching Generally 

Closely related to the effects of student ratings on the individual 
teacher is the possible impact that they have on teaching generally. Tlie 
critics of student ratings claim that an undue emphasis on the ratings, 
such as usin- thein to assist in decisions on faculty promotions, can have 
adverse effects on instruction. Wiiat are some of these adverse effects? 
First, some crjti«.s claim that the ratings do not allow for individu.'AL 
styles of leirhing, that they instead force everyone to be measured on lUr 
same yardstick. Hew people would try to assess artists or composers on 
same yardsticl^, according to one skeptic of student ratings, 'iluit skci't ' 
goes on to say, in an article in The American Scholar, tliat: 
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Hie art critic need not evaluate portraits painted by 
Picasso, Whistler, and Rembvandt in texros of criteria 
for effectiveness comnon to all. three. He finds it 
possible to examine each artist's work in terns of the 
artists* om goals, or to identify the strengths and 
weaknesses of an Individual painting in terms of 
relations of parts to the whole [Kossoff , 1972, p. 89] 

Even though I don*t happen to believe that teaching and art are 
entirely comparable, we know enough about teathlng to know that individuals 
can have quite different styles, and that they should probably develop 
the style that best fits their personality and approach. I'll return to 
this point in a minute. 

A second adverse effect of student ratings, according to the same 
critics, is that they encourage traditional modes of teaching. Most rating 
foxms are indeed directed at classes taught in seme combination of lecture- 
discussion, but logically so— that happens to be the way most courses have 
been taught and the foims are merely reflecting what is typically the case. 
The question is, however, are other methods such as student-centered 
learning, or nondirective teaching, or team teaching being stifled by the 
typical student rating foims? The answer, in ray opinion, is that they are 
if an institution does not allow some flexibility in the application of 
student ratings. TMs means that for some courses, and this is still a 
relatively small number on most campuses I suspect, it is necessary 
either to supplement or disregard items in the traditional rating forms. 

Flexibility in the en5>loyment of student ratings is, in other words, 
extremely critical ♦ Many of the widely used forms have been developed 
through idiat might be called the consensus approach. In other words the 
developers have asked san5)les of faculty members (or faculty members and 
students) to identify specific characteristics that are in^xjrtant in teaching. 
Those areas or items for vMch there was the greatest consensus were then 
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included in the rating instrument. Generally speaking, the items have 
centered around such factors as course organization, teacher-student 
interaction, and consnunication or verbal fluency. It's clear that this 
approach does not produce an instrument, that reflects any particular theory 
of teaching. And that probably has made good sense in view of the fact 
that it would be difficult to get any college faculty to agree on a single 
theory of teaching. 

While most forms allow individual instructors to add their own items 
to a basic set, there are other ways in which the rating foims can be even 
more flexible. If the items are to be used in making decisions on faculty 
members, then the individual teacher might be allowed to eliminate those 
items that are not relevant to his style. Better yet, a system might be 
implemented which allows teachers to both choose and weigh in advance 
the items which they feel most adequately reflect their style of teaching 
and what they are trying to accomplish in the course. At least one 
institution is now working on such an approach. 
Impact on Administrators 

Another group that student ratings influence- -albeit more indirectly 
than previous groups- -are college administrators. I have two observations 
to offer regarding this. First, that in instances where the ratings are 
used in making decisions on promotions, it mr*y well be that the dean or 
department chairman's job becomes a little easier. 

National surveys have told us that frequently the judgments of one 
or more administrators are relied on to assess teaching effectiveness, 
particularly at smaller colleges. Not many people would defend this as 
a very wise or valid approach. If we can assume that the evidence provided 
by student evaluations means not only wiser decisions but also ones that 
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are more easily defended, then students' evaluations make the administrators' 
jobs cusi.r and more effective. Some, I realize, would debate that point. 

A second observation that I have is that student evaluations may well 
be contributing to what seems to be a current groundswell for administrator 
evaluations by faculty members. A not too infrequent request to ETS is 
for an instrument to evaluate administrator perfoimance. Apparently the 
feeling is that if faculty can be evaluated by their constituents, then by 
all means so can administrators. Increasingly, it would appear that they 
are. For example, the trustees of the State University of New York 
announced in January that the presidents of the 29 colleges operated by 
the state will have to undergo intensive evaluation of their records every 
five years. But I m not at all sure that a handy-dandy machine- scored 
instrument could be developed that would measure reliably and validly an 
administrator's performance. More likely the charge is for administrator 
accountability (to use the still -currently "in" word) , in which an individual 
is accountable not only to his superiors but also to his subordinates. 
Impact on Students 

According to the results of the ACE 1972 annual survey of freshmen, 
students feel generally that faculty promotions ought to be based in part 
on studenc ratings. That opinion was endorsed by three-quarters of the 
students from the 373 institutions in the survey. This probably comes as 
no surpi-ise. The past decade has, of course, been a time when students 
have demanded a greater role in institutional decision-making, and the 
evaluation of teaching would appear to be an area in which they feel they 
can make a unique contribution. IVhere student ratings have been incorporated 
into faculty evaluation procedures, therefore, the impact on students is 
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likely to be quite positive; at least each of them can feel that he or she 
is helping the institution make important educational decisions. TJiis is 
not to be taken lightly. While in the past teachers and administrators 
hiive been willing to give students a say in such areas as the establisment 
of student personnel policies and regulations, they've been more reluctant 
to relinquish their hold on academic decision-making. 

Aside from this, probably the major impact of student ratings on 
students is provided by published course and teacher critiques. While 
some institutions make public the results of college-sponsored student 
evaluations (and some publish course guides based on detailed descriptions 
provided by the instructor) , most of the critiques are based on surveys 
that are student initiated and conducted. As you might suspect, these 
student-produced critiques vary considerably in quality from one institution 
to another; in fact, they may vary from year to year at single institutions, 
depending on vMch students get involved. The worst of the critiques 
have been based on poor samples and frequently border on sensationalism by 
highlighting the juiciest of criticisms. Needless to say these critiques 
do neither the teachers nor the students who purchase them much good. But 
what about the better publications; what about the critiques based on 
thorough methodology and which, as in some instances, also give the teacher 
an opportunity to respond to his student evaluations? Do they have a 
suitable reason for being? One might argue that they provide infonnation 
that the college catalog or other publications don't provide and this would 
seem to be a valid purpose. Nevertheless there are many faculty members 
vrtio object strongly to student-conducted course ratings. Their objections 
have been delineated by Kerlinger in a 1971 article in School and Society . 
He argues that student initiated ritkngs result in "instruc|pj hostility. 
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resentment, and distrust," and thus alienate faculty members from their 
work. He goes on to suggest that ratings are legitimate only if conducted 
voluntarily by professors and used for self- improvement. Obviously then, 
not only is there concern for who initiates and conducts a student rating 
of instruction program, but also to what end the results are to be used. 

Needed, it seems to me, is a major study of the effects of student 
ratings when they are used to assist in deciding whom to promote. There 
are a number of questions that such a study might investigate. For 
example, to what extent do faculty become alienated? Which types become 
most alienated? Does it encourage traditional teaching and limit teaching 
styles, as already discussed? Does it erroneously reinforce the notion 
in students that the instructor is largely responsible for how much students 
learn in a course? This last point may be true regardless of how student 
rating results are used and in spite of the fact that many of the rating 
forms ask students about their own effort and involvement in the course. 
But the major question to be answered by such a study is whether more 
defensible promotion decisions are made when student evaluations are 
included as part of faculty assessment. 
Impact on the College 

The last category that I will comment on is the iinpact, or possible 
impact, of student ratings on the college. 

I've already discussed changes that take place among individual 
teachers --or at least among some teachers. But can an institution, or 
perhaps the departments within an institution, learn soniething about them- 
selves from student evaluations? A corollary question is: "What can the 
institution or department then do about what they've learned?" 

Let's start at the department level. A seldom mentioned, though 
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seemingly worthwhile, use of student ratings is that of providing depart- 
ments, with information about the effectx.oness of their offerings as seen 
by students. To do this it would be necessary to combine the ratings of 
all members in a department, and items dealing with specific as well as 
general course objectives should be included in the assessment. In 
addition to these course- instructor evaluations, a sort of major field 
questionnaire might be given to seniors. Princeton University, for one, 
has been using a major field or department questionnaire for the past 
several years. While not the typical application of student wvaluations, 
the assessment of departmental offerings would seem to be worthy of 
consideration by other institutions. 

Another point that might be made concerning the departments .\s that, 
as many of us have discovered, there are some interesting variations in 
the evaluations that teachers in different subject fields receive. \n;cng 
a group of some 450 teachers, for example, I found that courses in the 
natural sciences, relative to those in humanities, social sciences, and 
education and applied subjects, were seen by students as having a faster 
pace, as being more difficult, and as being less likely to stimulate 
student interest. In addition, teachers perceived the natural science 
teachers in the sample as less open to other viewpoints. Humanities 
teachers, in comparison to those in the other three general subject areas, 
were less likely to infom students of how they were to be evaluated, and 
there was less agreement between the announced objectives of humanities 
courses and v^t was actually taught. 

The obvious question is whether it is the subject matter itself that 
produces these differences or the types of individuals within each of the 
subject areas. It may well Jbe a combination of both. At any rate, patterns 
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of ratings would indicate that subject fields or departments might fonus 
on certain apparent wealoiesses (for example, humanities professors might 
attend workshops on improving their evaluation procedures) . 

The whole notion of focusing on weaknesses highlighted by student 
evaluations could be applied at the college level even more generally. 
If a college is able to compare itself to other colleges— that is, if the 
aggregate ratings of all teachers can be compared- -then it may be possible 
to identify specific weaknesses. Workshops in that particular aspect of 
instruction might then be offered to assist in faculty improvement. 

Conclusion 

In this paper I've attempted to discuss the effects or poss/ 
effects of student evaluations on academia. It has been apparent through- 
out the discussion that the major effects are to a large extent, dependent 
upon how the ratings are used. Their primary uses can perhaps be sunmarized 
best by adapting Michael Scriven's (1967) terms for the Vmo major functions 
of tests: fonnative and summative evaluation. Tests used foimtively, 
according to Scriven,, give the instructor periodic feedback on his students ' 
progress, thus telling the instructor what needs to be stressed in the 
future. The summative function of tests, as the term implies, is a way 
of providing a summative evaluation of each student at some point in time. 

Wlien student ratings of instruction are used foimatively- that is, 
when they are used by instructors as a source of feedback on their 
teaching— the evidence indicates that some changes are made by the instructor 
And most likely we can improve on this with better interpretation of the 
results. The effects of using student ratings in a sunmiiitive way— that 
is, in making administrative decisions on faculty— is a little more difficult 
to assess. As a researcher I feel we ought to learn more about the 
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side- effects. But if I were a department chairman or dean faced with 
increasingly tougher tenure-promotion decisions, or if I were a faculty 
member v^o felt that his teaching was not being rewarded, then I might 
hold a different view. Certainly student evaluations are no less trust- 
worthy than other methods now available to assess teaching perfoimance, 
and when combined with other methods, they probably contribute co a fair 
judgment. 

In closing, I'd like to return briefly to the title of this talk. 
As yoi' have realized by this time, I don*t believe that students, through 
student ratings, are or will become the Mario Puzo type of Godfather to 
the academic comunity. But this is not to say that they might not 
function is a limited way as proper Godfathers. Traditionally, of course, 
a Godfather has had a much more positive image; he essentially is one who 
helps provide guidance and direction to those in his charge. While I*m 
not sugfiesting that students are the new saviors of academia, or that 
college teachers must rely on the guidance of their students, I do think 
that a well-designed student ratings program can do more to benefit than 
to ham the academic conmunity. 
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THE USEFULNESS OF STUDENT EVAUUATIONS IN 
IMPROVING COLLEGE TEACHING 
Lawz'ence M. Ale&moni 
University of Illinois 

In the past few years as a result of the 1970 student strikes and 
the emphasis on accountability, course and instructor evaluation has 
been placed in the spotli^t. In an attempt to build a total instruc- 
tional evaluation system, a great deal of emphasis has been placed on 
student evaluations of course and instructor. In order for student 
evaluations to be considered an integral part of a total instructional 
evaluation system, they must be both reliable and valid* 

Of the various systems developed for student evaluation of course 
and instructor, the Illinois C3ourse Evaluation Questionnaire (CEQ) 
has perhaps the most extensive reliability and validity data to support 
it as well as the most extensive norm data base. Norm data have been 
collected continuously since 1966 at the University of Illinois, Urbana- 
C3iainpaign campus. The c:EQ is used to collect student attitudes towards 
a course and instructor and its purpose- is to enable faculty menber^ 
to collect evaluative infbnnation about their teaching. Once the 
instructor has used the CEQ and submitted the forms for analysis, two 
copies of tlie results are returned only to the instructor. As the 
number of measures on each course is increased, it becomes possible to 
obtain a relatively stable indication of the difference between courses. 
This aids in the interpretation of the actual differences between an 
obtained section score for a particular instructor and the average 
scores for all the sections represented in that course, 
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The analysis of item inter- relationships and the subscore inter- 
relationships indicated that no one element, related to a course, 
disproportionately influenced the students* evaluation of the course 
(Spencer ^ Aleamoni, 1969). It appears that there is a "general 
course attitude" cultivated by the student as he is exposed to pre\rious 
student*s comments, the instructor, the testbook, the course, etc., and 
this is the framework from which he responds when answering the CEQ 
items. 

It would seem, on the basis of three validity studies (Stallings 
§ Spencer, 1967; Swanson § Sisson, 1971; Aleamoni § Yimer, 1972), the 
face validity of the CEQ, and its higji reliability, that extremely low 
scores on a particular subscore shoijld indicate problem areas in an 
instructor's teaching procedure. Whereas, stable high scores should 
point to an effective instructional program as viewed by students. 
All available validating evidence (both published and unpublished 
studies), to date, indicates that the CEQ does indeed identify 
courses that are considered to be excellent or poor. 

After using the CEQ, the instructor receives results (see Appendix 
A) vdiich allow him to con5)are his course item means to institutional 
course item means (via deciles) and his course subscale means to noim 
subscale means categorijied by (a) rank of instructor, (b) level of 
course, (c) institution, (d) college, and (e) all institutions that have 
used the CEQ throughout the United States. The subscale results allow 
the instructor to obtain an indication of major areas of strengths and 
weaknesses in the course. Once the areas of weakness have been identified 
by the subscales, then looking at the item results helps to focus on the 
more specific problem aregs. The CEQ items are completely diagiiostic but 
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do serve to elicit diagnostic responses from the instructor teaching the 
course. It provides a means whereby some evaluation of the teaching 
process can occur; other means can be arranged and are available such 
as asking more diagnostic questions in the optional item section 
available on the CEQ fom, or having peers sit in on actual class sessions , 
etc. It is important to recognize, however, that student opinions are 
in existence and. do affect learning- -and they do provide a source of 
quite reliable and valid data relative to the effectiveness of instruc- 
tion (Cos tin, Greenou^ ^ Menges, 1971). 

In order to provide instructors with items that may be more 
relevant or diagnostic for their particular courses , a catalog of items 
was generate J by the Measurement and Research Division of the Offices 
of Instructional Resources at the University of Illinois, Urbana-Chan^jaign 
campus. The items were gathered from all existing sources such as 
institutional, national, departmental, and individual instructor 
questionnaires. They were then restated so that the resporise categories 
of strongly agree (SA) , agree (A) , disagree (D) , and strongly disagree 
(SD) would apply. This then made it possible for those items to be 
used in the "Optional Item" section of the CEQ (see Appendix B) . 

This collection of some 270 items was divided int . 19 categories 
consisting of: (a) instructor contribution, (b) attitude toward students, 
(c) student outcomes, (d) relevance of course, (e) use of class time, 
(f) organization and presentation, (g) clarity of presentation, (h) 
instructor characteristics, (ij interest of presentation, (j) expecta- 
tions and objectives, (k) behavioral indications of course attitude, 
(1) general attitude toward instructor, (m) speed and depth of coverage, 
fn) out-of-class, (o) examinations, (p) visual aids, Jq) grading, 
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(r) assignments, and (s) laboratory and recitation. 

ThQ response to the availability of the catalog of optional items 
was gratifying in that it was not finished until Deceirber 12, 1972, less 
than four weeks before the end of the fall semester. Of 1414 course 
sections using the CEQ during fall semester 1972 » approximately 313 made 
use of the optional item section. 

After the instructor has decided to use the CBQ and/or any 
optional items of his choice, it is then up to him to decide what to do 
with the data. If he feels that the interpretation manual (Aleamoni, 
1972) and abbreviated interpretation sheets are not sufficient to 
help him identify areas that may need iniprovement in the course, he 
can then arrange for a conference with one of the members of the 
Measurement and Research Divison staff. Such a conference would begin 
with a close scrutiny of the CEQ siibscalc results to see if any 
problem existed based on the norm data available. If a problem area 
was identified (such as Method of Instruction) then a close look at 
tiie items making up that subscale would be in order. If, in the 
discussion with the instructor the source of difficulty is identified, 
then the discussion would shift to possible ways of trying to resolve 
the difficulty. If, on the other hand, the source of difficulty cannot 
be identified using the existing items and the instructor's recall, 
then procedures (such as the use of optional items that are much more 
diagnostic) would be colored to be able to identify the specific 
problem. 

It has been throu^ a process such as this that instructors have 
been able to use student evaluations to identify instructional problems 
and then rectify them. Obviously, the success or failure of such a 
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venture rests solely with the instructor and his willingness to both 
gather and use the data provided him. 

A question that naturally arises from the above considerations is, 
"Can student evaluations of instruction and instructor be useful in 
improving college teaching once they are made available to the instructor?" 
Although there has been a great deal of anecdotal evidence to suggest 
that such evaluations do have a positive effect, no studies to date 
were available to support that "evidence." Since the author has been 
involved in utilizing student evaluations to help instructors identify 
and diagnose instructional problems, the data was available to conduct 
the present study. 

Method 

Instructors at two different institutions (ttiiversity of Arizona 
at Tucson and Sheridan College at Sheridan, H^noming) vAo had used the 
CEQ during the fall, 1971 and spring, 1972 terms for their courses 
were the subjects of the present study. Each of these instructors 
was then scheduled to talk with '•Jie author about his/her results. The 
conferences were conducted individually at the hone canpis of the 
instructor and took approximately 15 to 20 minutes. Hie conference 
began with a close scrutiny of the CEQ subscale results to see if any 
problems existed based on the norm data available. If a problem area 
was identified (such as Method or Instruction) then a close look at 
the items making up that subscale would be in order. If, in the 
discussion with the instructor the source of difficulty was identified, 
then the discussion shifted to possible ways of trying to resolve the 
difficulty. If, or the other hand, the source of difficulty was not 
identified using the existing items and the instructor's recall, then 
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procedures (such as the use of optional items that are much more 
diagnostic) were explored to be able to identify the specific problem. 

In order to attempt to answer the question of use&lness of 
student evaluations in improving college teaching, each instructor 
who had participated in the individual conferences was subsequently 
followed-up to see if any significant change had occurred in their 
student ratings in subsequent teims in the same or continuous courses. 
Similar CEQ data for instructors who were not able to participate 
in the individual conferences was available to use as a control groi^ 
measure. 

Means, standard deviations, class sizes, and nom deciles were 
obtained for each of the above instructors on five of the CEQ 
subscales as well as the Total, That data (presented in Table 1) 
was then analyzed to detexmine if the conferences had any significant 
effect in helping the instructor in^rove his/her teaching as 
reflected in subsequent student evaluations measured by the subscales 
and Total score of the CEQ. 

Insert Table 1 about here 

Results 

In looking at the nom decile changes that took place for the 
lowest subscal<) value discussed in the conference (see Table 2), it 
appears that the conferences did have a significant effect especially 
when compared to the control group noim decile changes. Ihe average 
norm decile increase for the experimental group as observed in Table 2 
is 3.94 compared to .57 for the control group. It varies slightly 
for each of the two institutions. The range of norm decile increase 

ERIC - 58 
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for the experimental group is from 2 to 8 compared to from -2 to 3 for 
tlie control group. 



Insert Table 2 about here 



References 

Aleainoni, L. M. Results and Interpretation Manual: Illinois Course 
Evaluation Questionnaire, Research Report No. 331. Urbana, Illinois; 
Measurement and Research Division, Office of Instructional Resources, 
University of Illinois at Urbana-Giaiipaign, 1972. 

Aleamoni, L. M. § Yimer, M. An investigation of the relationship 
between colleague rating, student rating, research productivity, and 
academic rank in rating instructional effectiveness. Research Report 
No. 338. Urbana, Illinois: Measurement and Research Division, Office 
of Instructional Resources, University of Illinois at Urbana- Chainpaign, 
1972 • 

Costin, F., Greenough, W. T. S Menges, R. J. Student ratings of college 
teaching: Reliability, validity, and usefulness. Review of Riur ational 
Research . 1971, 41(5), 511-535. 

Spencer, R. E, § Alf raoni, L. M. The Illinois Course Evaluation 
Questionnaire: A description of its development and a report of 
some of its results. Research Report No. 292, Urbana, Illinois: 
Measurement and Research Division, Office of Instructional Resources, 
University of Illinois at Urbana-Chanipaign, 1969. 

Stallings, W. M. § Spencer, R. E. Ratings of instructors in Accountancy 
101 from video-tape clips. Research Report No. 265. Urbana, Illinois: 
Measurement and Research Division, Office of Instructional Resources, 
Uiiversity of Illinois at lirbana- Champaign, 1967. 

Swanson, R. A. § Sisson, D. J. The development, evaluation, and 
utilization of a departmental faculty appraisal system. Journal of 
Industrial Teacher Education , 1971, 9(1), 64-79. 



59 



49 



0) 
iH 



9 

•d 

CO 



•H 



■H 



ERIC 



5 

en 



0) 
iH 
•H 

e 



3 

I 



3 



<0 



0) 

I 

1>3 



4J 



IE 

1^ 



•H O 

1^ 



fH cn 



CO 

CM ro 



r4 cn 
CM ui 



^ 00 



to iH 
O \D 



too 

O I/) 



ro to 



00 to 



to 



o> to 



m to 



CM 



to in 
CM tn 



o ai 

CM I/) 



to 00 



to 

CM \0 



CM iH 
CM li> 



CM to 



o to 



lO to 



00 to 



\o to 



^ to 

\D 00 
CM 



tH NO 



O CM 

to in 



to oi 



00 t*^ 



ro to 



00 ro 



o% ro 



Ol CM 



in CO 

o> NO 



CM ro 

CM 



CM vO 
CM to 



iH m 



NO 00 
to NO 



m 



to 



00 to 



ro 



CM 



o\o 
o m 

to 



m 00 

CM 



in 

CM to 



m ro 

iH so 



NO 



o in 



m ^ 



ro to 



00 ro 



a% to 



cn to 



CM 



m o> 
o> t^ 

CM 



^ to 



in CM 
^ in 



CM m 
m 



S5 



ro 



ro to 



00 to 



00 to 



ro 



m ro 



« i-'S l« 1-^ i« 1« i-"^ i« i-*^ l« ii 

• OU(u •OUiD *oo<u *ood) •O^^v *ou 



0) 



o 

<M 



CM 
CM 



CM 



to 



4^ 

I 



(0 



I 
1 



i 

•H 

4 



I 



CM 



4 



to 

I 



NO 



in 



NO 



NO 



m 



o 

H 
•H 



60 



50 



I 



ERIC 



0) 

■3 
I 



I 



I 



si 



o u> 

iH NO 

»0 ' \P W 



rH NO 



o 



o m 

NO K) 



00 

NO to 



NO m 



o> to 



O CM 

cn to 00 



CM NO 
K> NO 
t t 

to 



to NO 



NO to 



CM 



iHin 
in to 



NO iH 
NO U) 

• t 

to 



o ro 
1^ ro 



CM CM 



CO to 



A to 



00 



o ^ 

O NO 
f • 

to 



NO O 

o t^ 



as 00 
a% NO 



to NO 

om 



tHO 

^no 



NO to <M 



o to t^ to 



a to 



r-- iH OlH 

Ul NO to NO 

• • • • 

Oi ro o to CO 



O •H 

^ y 



1^ 



d ♦J 



O NO 
0% NO 

CM 



O 

to to 



o 
t^ to 



G% &0 
to CM 



CO NO 

to to 



CM to 



to 00 

to to 



CM NO 

to to 



NO to 



o to 



a% to 



cm 



to 00 
O NO 

to 



o t^ 
CO to 



CO CO 

cn NO 

NO CM 



o 
to 



o o 
to to 



00 to 
to CO 



to 



oi to 



c% to 



to 



00^ 



c% to 



• to o 

to NO 

to 



NO to to 



CM to 



NOO 
CM to 



O NO 



NO to 



NO to 



NO 

to 



^5 



NO lA 



o to 
to to 



Oi to 



ai to 





Mean 
S.D. 
Norm 
Decile 
Mean 
S.D. 
Norm 
Decile 
Mean 
S.D. 
Norm 
Decile 
1 Mean 
1 S.D. 
1 Norm 
! Decile 
Mean 
S.D. 
; Norm 
1 Decile 
Mean 
S.D. 
Norm 
Decile 


i S.D. 
Norm 
Decile 
i Mean 
S.D. 
Norm 
Decile 




vO r-» iH rH iH 


Pre 


Pre 

Post 

Pre 

Post 

Pre 

Ptost 

Pre 

i 

Post 


Experimental 
Control 


1 
• 


Control 
Control 

1 

Control 




Institution 
Instructor 


Arizona 4 
Arizona 1 
Arizona 2 

• 

Arizona 3 





I 



o <s 



o in 
o 



00 CM 



in 



00 

CM 



to tn 



O CM 
iHOO 

00 ro 



CM CM 



in 



in 



O K) 

to 



in iH 



CM 



. ^ O so 

• • • • 

m vo K) 



in 



if 
a; 



J 



rHO OQ 

00 NO o% 
• • • • • • 

CM H CM •^t CM >0 



op p. 



CM 

o in 
• • 

to 



r-4 to 



CM 



CM k5 



in CM 

O 00 



* 

CM 



00^ 

0> *n 00 



CM CM 



SO CM 



o tH a> 00 in 

00 ^ 00 H t^so 

• • • • • • 

to CM ^ CM in CM 



o^in 

CM^ 



iH 00 

a% 00 



Hi ^ 



5i 



CM 

CM in 

CM 



in 00 

CM 



CM 



00 in 
CI so 
• * 

CM 



SO 



^ o 

0\ so 
CM 



KJ H 



CM 



8 



in 



oo 

CM 00 



m to 



so 



00 



CO 



rH M 
Oi VO 

CM 



CM 



• 



in 



CM o 

in 
to 



K) m 



C^ K> 



00 q\ 



1^ o 
CM so 



in tn 



CM to 

to 



so to 
O 00 
• • 

so to 




;2: 


CM 


00 


lO 

CM 


CM 


cn 


in 


00 


so 

rH 


Pre 
Post 


Pre 
Post 
Pre 
Ptost 

Pre 

Post 

Pre 



52 



3 

i2 



4J 



to a 



in o 
csi 



CM 



CO CM 



00 IS. 
CM 



to 



» to 



O CM 
• • 

to 



to to 
to 



to o 
to 



U) to 
'd- to 

to 00 



CM 



to 



CM ro 



00 NO 
CM 



CV4 CM 

to 
to 



to 



00 iH 

10 to 
to 



to 



to 



(0 



(/) •H 
<J 



O vo 

to \o 



to 
•• • 

CM 



I/) iH 
to sD 



CM to 



00 iH 
CM 



to LO 
00 so 

CM 



ts 



U% CM 



00 CM 

to 



o to 

0> 00 

CM to 



4-> 

I 



N-i O 

O •H 



^3 4-) 



to to 
to ^ 

to 



to to 
rs so 

CM 



CM 



ts O 
1-4 to 

to 



o o 

fs 00 



00 CM 



CM 



iHCM 

at in 

CM 



00 ^* 

CM 



to O r-4 M 
O to t^ 
• • • • 

to CM to 



to 



O CM 
CM 



to 



to 



to vO 



a% CM 



CM 



C7> 

to 



00 



CM 

• • 

CM 



to 

to in 



to to 
CM 10 



NO to 



o to 



00 



(0 



u 

C5 



-2 



•H 
< 



00 

to so 

to 



to 

00 
CM 



CM 



in 00 
to 



a% to 



in 



00 ^ 

to ts. 

to 



to so 
iH SO 

to 



in 



pH 00 

to ^ 
to 



CM t^ 

to NO 



|q B.'S |d e.'S |d g.'S §Q S«; B.^ So B.'S §«; bS |U 



00 



CM 
CM 



CM 



CM 

to 



0) V) 



I 



I 



I 



is 



0) 



3 



I 



•H O 

•H . . 



to 



•H 

I 



in 



i 



63 g 



i2 



53 



00 m 



W1 r-l 
CM 



CM 



rH O 
O ^ 
« • 



»0 



CM 



^4 O 

in to 



9k 



m o 



in 



CM 



in 



o 

^4 1^ 



a 



o ^^ 
to 



rs^ CM 

K 00 
CM 



in o 

H iO 



CM 



CM 



fH in 
• • 

in 



M in 



fN 00 



in 



in 
u 

I 

8' 



8 -a 



^9 

^ NO 

CM 



CM 



\ooo 
• 

CM 



in \q 
00 in 



tn CM 



in 



9S 

CM 



to 



00 tn 
in m 
• • 
to 



9^ 



09l 
9>^ 

CM 



in 



fH in 

so NO 
CM 



to 



CM to 
NO 00 

CM to 



as 



CM 



NO NO 



in 00 

00 CM 



CM CM 



to CM 



^ CM 



CM 
CM 

CM to 



00 ^ 



to IS. 
O CM 



0% CM 



^ to 



CM 



NO 



O •ri 

L 



to ^ 

NO ^ 

CM 



to 



NO in 

^ 00 
CM 



CM to 

fH CM • NO CM 



to to 
to 

• • 

to 



iH 00 

00 in 

CM 



in 



CM O 

in 
% • 

CM 



CM 



o in 
• • 

CM 



to 



fH •H 



in 

• • • • 

CM to to 



00 

CM 
• * 

»0 



00 fH 0> 

NO NO tO iH^ to to to vD 
• • 

NOto to 9i to in to r^to 



I 



Sd So So e.'S §Q Sd 8:3 So IS |n Sd B.'3 



to 



CM 
iH 



00 



o 

iH 



9 



00 
CM 



+4 



44 

I 



£ 



I 



I 



64 I 



54 



o 

U 

I 
s 



I 



4> 
iH 



i2 



10 o 

o o 



1^ c 

4| 



o 

O •'O 
iH -H 

Si < 

CD 



(\5 O 

si* 



r: o 
o 

0) CJ 



ERIC 



o 

3 



r4 00 

to 



O NO 

to 



00 



CO O 



•t ^ 

to 00 



to O 

C3 



00 

to 



CO 



>0 I/) 



r- o 

to >0r 



O CM 



on >o 

O NO 

to 



.00 vO 
CM 



cn 00 



00 



to CM 
so ^ 
• • 
to 



to 00 
CM SO 

to 00 



to lO 
^ lO 

to 



r-l rH 
rH in 
• • 

to 



t^ M 
CO ^ 



rH Cn 
00 00 



00 CM 



^ CM 



to 



to 



o 

CM 
• • 

to 



o o 
o a% 
• • 

to vO 



O CM 
00 

fM 



00 <n 
00 

CM 



LO 



to LO 

to 



fH CM 

to 00 
to 



rH in 



a% CM 



tH 
l/> NO 
« • 

to a» 



VO o 

CM LO 

to 



vO 



r^j lo 
to 



to H 

<t SO 
* • 

to 



o to 
t^ to 



<n to 



NO O 
LO to 

to o\ 



So B:^ gd |S ?n g5 So l-H |<3 g.^ 



CM 
LO 



to 



00 



00 



4^ 

i 



I I 



O 

I 



r/3 



•H 



s 

0) 



65 



Table 2 

Norm Decile CHianges for the Lowiest Subscales 
Value Discussed in the Individual Conferences 



Institution 



Experimental 



Control 



Arizona 


Pre 


Post 


Increase 


Pre 


Post 


Increase 








.Decrease^ 


. .Decrease 


1 


2 


6 


4 


5 


4 


-1 


2 


5 


8 


3 


9 


9 


0 


3 


4 


6 


2 


9 


7 


-2 


4 


5 


7 


2 


0 


2 


2 


5 








3 


4 


1 


Mean 


4.00 


6.75 


2.75 


5.2 


5.2 


.00 

• WW 


Sheridan 








1 
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1 


2 
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6 


2 
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6 
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8 
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2 
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2 
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4 


4 








9 


2 


4 


2 








10 


4 


9 


5 








11 


1 


6 


5 








12 


3 


8 


5 








Mean 


2.42 


6.75 


4.33 


4.5 


6.5 


2.0 
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Appentlix A 
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RfSULTS FOR THE OBJECTIVE ITEMS ON THE ADVISOR QUfSTlONMAIRf: 

FDPSY <»90 SECTION H ENROLsOOOS FALL 1971 03620J 

SFX 

FEMALE MALE OMIT 
0.20 0.20 0.60 

MAJOR-MINOR 
MAJOR MINOR OTHER OMIT 
O.UO 0.20 0.00 

JOURSE OPTION 
REO ELECT OMIT 
0.40 0.40 0.20 

PASS-FAIL 
VES NO OMIT 
0.00 0o6D 0.40 



FRESH SOPH 
0*00 0.00 



STATUS 

JR SR ORAO OT ?R OMIT 
0.00 0*00 1,00 0*00 0.00 



EXPECTED GRADE 
A B C D E 

O.oO O.<»0 OeOO OaOO 0.00 

COURSE GRADE 

• B C 0 E 
0.80 0.20 O.flO 0.00 0.00 

INSTRUCTOR GRADE 

* 8 C D E 

1*00 0.00 0*00 o.on q.qq 



OMIT 
0.00 



OMIT 
0.00 



OMIT 
0*00 



ITEM 



8. 

9( 
10. 
11. 
12. 
13. 

m, 

IS. 
IE. 



0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 



SA 

>00 
>00 
>60 
>40 
>80 
60 
80 
00 
20 
00 
00 
20 
SO 
00 
00 
80 



A 

0.00 
0.00 
0.40 
0.60 
0.20 
0.40 
0.20 
0.20 
0.80 
0.00 
0.00 
0.80 
0.40 
0.00 
0.00 
0.20 



0 

0.40 
0.40 
0*00 
0*00 
0.00 
0.00 
0.00 
0.40 
0.00 
0.40 
0.?^ 
0« JO 
O.U] 
0.40 
0.20 
0.00 



SO 

o.so 

0.60 
0«00 
0.00 
0.00 
0.00 
0.00 
0.40 
0.00 

r.60 

0.60 
0.00 
0.00 
0.60 
0.80 
0.00 



OMIT 
0.00 
0.00 
0.00 

o«oc 

0.00 
0.00 
0.00 
0.00 
0«00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 



BEST 

so 
so 

SA 
SA 

:k 

SA 
SA 

so 

SA 

so 
so 

SA 
SA 

so 
so 

SA 



MEAN 
3.80 
3.60 
3.60 
3.40 
3.80 

3.60 
3.8r 
3.20 
3.20 
3.60 
3.80 
3.20 
3.60 
3.60 
3.80 
3.80 



S.D. 

0.55 

O.SS 

0.55 

O.SS 

0.45 

O.SS 

0.4S 

0.84 

0.4S 

0.58 

0.45 

0.45 

0«S5 

0.55 

0.45 

0.45 



DEa 
9 
7 
9 
9 
8 
9 
9 
8 
8 
9 
9 
7 
9 
9 
9 
9 



01 23 «S 67 89 
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HERMAC — TEST ANALYSIS AND QUCSTIONNAfUr PACKAGE 



ITEM 
17. 
18. 
19. 

20. 

?l. 

22. 

23. 

24. 

25. 

26. 

27. 

?S. 

29. 

70. 

31. 

32. 

33. 

34. 

35. 

36. 

37. 

36. 

39. 

HO, 

HI- 

H7. 

H3, 

HH, 

45. 

46. 

47. 

48. 

49. 

SO. 



SA 
O.DO 
1.00 
0.20 
0.60 
0.40 
0.4C 
0.20 
0.00 
0.40 
0.00 
0.60 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
X).40 
0.60 
0.00 

n.co 

0.00 
0.40 
0.00 
1.00 
0.00 
0.00 
0.00 
0.00 
0.80 
0.00 
0.40 
0.60 



— SU8SC0RC — 

GENERAL ATTITUDE 

METHOD 

CONTENT 

INTEREST 

INSTRUCTOR 

SPECirrc ITEMS 

TOTAL 



A 

0.00 

0.00 

0.80 

0.40 

0.40 

0.60 

0.00 

0.00 

0.60 

0.00 

0.40 

0.20 

0.00 

0.80 

0.00 

0.40 

0.00 

0.00 

0.60 

0.40 

0.00 

0.00 

0.40 

0.60 

0.20 

0.00 

0.00 

0.20 

0.00 

0.00 

0.20 

0.00 

0.60 

0.40 



D 

0.20 

0.00 

0.00 

0.00 

0.20 

0.00 

0.20 

0.20 

0.00 

1.00 

0.00 

0.80 

0.00 

0.20 

0.00 

0.60 

0.40 

0.20 

0.00 

0.00 

0.40 

0.60 

0.60 

0*00 

0.60 

0.00 

0.80 

0.40 

0.60 

0.20 

0.00 

0.40 

0.00 

0.00 



SO 
0.80 
0.00 
0.00 
0.00 
0.00 
0.00 
0.60 
0.80 
0.00 
COO 
0.00 
0.00 
1.00 
0.00 
1.00 
0.00 
0.60 
0.80 
0.00 
0.00 
0.60 
0.40 
COO 
0.00 
0.20 
0.00 
0.20 
0.40 
0.40 
0.80 
0.00 
0.60 
0.00 
0.00 



OMIT 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 

coo 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 



BEST 

so 

SA 
SA 
SA 
SA 
SA 
SO 

so 

SA 

so 

SA 

so 
so 

SA 

so 
so 

SD 

so 

SA 
SA 

SD 
SO 

so 

SA 

so 

SA 
SO 

so 
so 
so 

SA 

so 

SA 
SA 



MEAN 

3.80 

4.00 

3.20 

3.60 

3.20 

3.40 

3.20 

3.80 

3.40 

3.00 

3.60 

2.80 

4.00 

2.80 

4.00 

2.60 

3.60 

3.80 

3.40 

3.60 

3.60 

3.40 

2.60 

3.40 

3.00 

4.00 

3.20 

3.20 

3*40 

3.80 

3.80 

3.60 
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FACULTY LmUATION: SOME CONSIDERATIONS AND A MDDEL 

Kenneth 0. Doyle, Jr. 
University of Minnesota 

Some months ago I happened to be having dinner with a fellow from the 
governor's staff "in one of our great midwestem states," The topic of 
universities came up in our conversation, particularly the topics of 
accountability and faculty evaluation. I was describing some of the problems 
involved in developing systems of faculty evaluation yAien he cut me off: 
There's nothing to it, he snapped; you simply assign monies to departments 
on the basis of their contribution to the gross national product! 

I'm not going to tell you what happened after that-- just that it was 
not one of the most enjoyable meals I've experienced! His comment scared 
the daylights out of me, though, and underscored the importance of developing 
our own internal systems of evaluation before something less meaningful 
and less palatable- -is imposed on us. 

With this added motivation I went into the literature with hopes of 
finding systems of evaluation that our institution might try on for size. 
I talked with faculty and students and administrators from various schools. 
What I found- -with a few encouraging exceptions --was .that faculty evaluation 
is a chaotic enterprise, as technically, politically, and conceptually 
complex as even the most masochistic of us could hope to enjoy. 

Since I'm a bit of a com^ ilsive sort, I needed to try to make order 
out of this cJiaos. Let me share with you what I've done thus fai. 
Considerations Concerning Faculty Evaluation 

I believe there are a number of considerations that obtain for any 
system of faculty evaluation. We need to tliink about the purpose of the 
evaluation, the focus and consegueno^^f the evaluation, sources of 
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measurable data and the tiuality of those data. We need to attend to tlie 
Roals of the institution. Aiid we need :o consider the media t'or gatlu-r ijij; 
and reporting data arid the teninoral dimension along which the data must bo 
gathered and interpreted. I'd like to develop each of these considerations 
a bit, tlien tie tliem together into the beginnings of conceptual schema, and 
finally show some applications of that schema. Although everything I say 
should pertain to all of faculty evaluation— advising, researcii, governance, 
and service as veil as teaching-- I'll draw most of ray examples from' the 
evaluation of teacliing. 
Purposes of Evaluation 

There seem to be tJ-iree more or less distinct and commonly proposed 
reasons for undertaking an evaluation: (1) to help improve faculty perfom- 
ance, (2) to help make personnel decisions concerning faculty; and (3) to 
provide a criterion measure for various kinds of educational research. 
Another purpose exists exclusively for the evaluation of teaching, nam- iy 
to provide information that could help students choose their courses. Since 
I think that any criterion measure we might want to provide for research 
can come from purposes (1) or (2), I'll limit my ronarks to the other 
purposes: to improve perfomauco, to help in personnel decisions, arvi, 
for teaching only, to counsel students. Lets look at each in more detail. 

Evaluation to .improve facult>' performance, which seems to be the most 
frequently stated puq-jose Cor doing cvMluation, is distinguished from the 
other kinds of evaluation in thcit it attanpts to diagnose strong arul weak 
points in faculty behavior witli the intent ol' helping remedy the wcal-oicsses 
and reinforce the strengths. I want to emphasize that when I say "faculty 
performance" I'm not talking exclusively about teaching; I'm talking aboiit 
the evaluation of all aspects of professional behavior- -advising research, 
governance, and public soivico, as v/cll as teaching. Nor am I restr ictjuu 
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our inl-onnatioii to student data; I'm including data from colleagues, 
adiiUni:itrators, the public, and the faculty manber himself. 

Hvuluations for personnel decisions focus on the rank, pay, and tenure 
determinations that lie near the heart of the student ratings controversy. 
*fhese are evaluation data tliat help in the selection of faculty from a pool 
of applicants, in the placement of existing staff according to their abilities 
and attitudes (not just their interest and availability), and in the retention 
and promotion (or demotion) of faculty as a consequence of their professional 
perfomiance. Again we need to remenber that the sources of data are many and 
the behaviors to be evaluated varied. 

People sanetimes seem to make too clear-cut a distinction between 
these two purposes of evaluation. In theory, such a differentiation is 
sound, and it leads to some pointed considerations about, for instance, 
levels of reliability and validity that need to be established for the 
difterent uses of the data, and about techniques for gathering and analyzing 
information. (E.g., typical forced-choice scales are more suitable for 
personnel decisions than for improving perfonnance because these scales don't 
usually furnish diagnostic or formative information.) But in practice the 
distinction breaks down to some extent. For example, although we might 
claim that the reliability of a particular instrument permits its use "only" 
for ijnproving teachi ig, we have no way to restrict the use of the data once 
they artp out of our binds. (Eventually I would hope that these two purposes 
will become even less distinct, that data to improve teaching and data for 
personnel decisions will overlap considerably more than they do now.) 

'Hie thiid purpose for evaluation seems to pertain only to teaching 
evaluation for the purpose of counseling students. These evaluations are 
intended to provide information that students might use to select among 
available clai:scs or instructors-- or, for that matter, institutions. 
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This kind of infomation is by nature public and might be made available 
directly to students in bookstores, in student unions, in departmental 
or college offices, and so forth, or access to it can be restricted to 
certain professionals-- advisors and counselors, for instance. Other 
modes of access have been suggested. Some students at our institution 
have suggested a university telephone number at which a student operator 
could read the information to callers— rather a large responsibility for 
the operator and rather a busy operator at some times of the year I And a 
group of unusually imaginative students has been considering a system of 
computer terminals (CRT's) strategically located around the campus, which 
students could use to call up course -selection infomation from central 
data storage pools. More typical examples of this kind of evaluation are 
the phoenix- like Salvage from the University of Minnesota, the Adviso r 
from the University of Illinois, and an intriguing two-part description/ 
evaluation handbook from the University of Utah that seons to avoid many 
of the problems inherent in these kinds of undertakings. 

I think data for this purpose need scsne special scrutiny. There are 
the usual problems concerning the reliability and validity of published 
infonnation, but the specj.al problem here seems to be the General Bullmoose 
Fallacy that what's good for the average student is good for all students. 
I would be much more comfortable if published data were (almost?) exclusively 
objective descriptions of course goals, contents, and other characteristics, 
or- better still— if what the co- rse offered were spelled out in tems of 
a profile of educational needs. Although this idea is not rare with regard 
to institutional profiles, little or no work of this kind seems to be taking 
place on the more speci f ic classroom level . 

But evaluation of teaching for purposes of course selection is probably 
here to stay. And so we have three kinds of evaluation that seem to cover 
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What most of us mean by the tenn. 
Dimensions of Faculty Activity 

Once we know why we want an evaluation, we need to know what we want 
to evaluate. What aspect of the faculty member's behavior do we want 
infoimation about? The major thrust of this conference has to do with 
student evaluation of teaching, but there are certainly other faculty 
activities that might profit from evaluation of some formal and systematic 
kind: advising, research and fundraising and publication, governance 
(e.g., committee work), public service, and so forth. 

Each of these rather broad areas can be subdivided. For example, with 
regard to classroom teaching, focus might be on the objectives of instruction, 
the behaviors of the teacher or tutor (communication, organization, etc.) , 
the various instructional materials (texts, other readings, handouts, audio- 
visual materials) , the physical environment, and the social environment. 
To this listing we can add really anything that "impinges on the senses of 
the people involved", subject only to the constraints of manageable length 
and "reasonable" content. 

Clearly I'm working toward a stimulus-organ ism-response conceptuali- 
zation of the teaching process, and the list I've just described details 
to sane extent the stimilus component. There is also the organism ccwiponent, 
by which I mean the cognitive operations that the student applies to this 
stimulation. To evaluate a teacher by looking at the cognitive processes 
of students - cognition, memory, convergent and divergent thinking, and 
evaluating, to use Guilford's list - is theoretically possible, is probably 
of critical importance, but is certainly beyond our present capabilities. 
Nevertheless, this is a focus about which we need to be occasionally reminded. 
J. P. Guilford has furnished some of the classical work on cognitive operations, 
and Bloom has provided his taxonomy; but some^f the most exciting and most 
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recent work is being carried on by Russell Burris and associates at the 
University of Minnesota's Center for Programmed Instruction. In the 
context of ccraputer-assisted iiijtruction of material from beginning German 
to hematology to literary criticism and insurance law, Burris is working 
toward the identification, definition, and measurement of. the dimensions 
of breakthroughs here. For the time being, however, I'm afraid that the 
inner workings of the student are beyond our reach. 

But there is still the other side of the stimulus -organism-response 
structure, the response or output or product or perfoimance side, which 
is essential to an evaluation of teaching. What did the student get out of 
the course? What student products or perfoimance can we look at as indices 
of the effectiveness of the teacher? In the usual classroom situation, we 
can look at term papers, quizzes, and examinations. We can listen to oral 
reports and give oral exams. We can observe demonstrations. And we can 
evaluate work samples, whether the work is a statue in a studio arts class 
or criticism of a research design in a measurement class. The point is that 
we need to analyze products or perfonnances from the student if we want to 
claim even a relatively comprehensive system for evaluation of the teaching 
component of faculty behavior. The fact that propels me so forcefully to 
this emphasis is not the aliberal vocational training argument but the 
human need to be goal -oriented. I worry that most of our evaluation 
activities pertain to the input side of the S-O-R structure - our own 
teaching behaviors , the materials we use , the social > physical environ- 
ments in which we teach. I contend that more emphasi . on the student 
response sicie would help disengage us from too much preoccupation with 
ourselves, our "styles", and our materials and would lead us to focus oh 
those goals toward which our efforts are intended. Furthermore, this 
goal-orientedness should make any stylistic clianges we make more likely to 
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be valid in the sense of contributing to student learning. 

In this vein, I would like to suggest a somewhat more orderly than 
customary approach to examining students, both for the sake of the testing 
itself and for the sake of that part of faculty evaluation that depends on 
student performance, rd like to enter a plea for planned examinations, 
tests explicitly constructed according to a schema that reflects the purposes 
of the instruction and that recognixes not only the differential importance 
of the various subtopics of the material but that tests students on different 
••epistemological" levels - recall of fact, comprehension of ideas, application, 
analysis, synthesis. Bloom's Handbook and Thomdike's instant-classic on 
Educational Measurgnent would be mxpsrh reference works in this regaxd. 

But back to the evaluation of faculty. Obviously we can't judge a 
teacher on the basis of uwiuallfied student performance. We need to attend 
to ccn^lex qualifiers like student ability and motivation and other factors 
that I'll mention under the heading of Quality of Data. 
Quality of Data 

Ihe qiiality of all evaluative infozmation is critically important. 
Infoimation - vfliether from a questionnaire, a written report, an Interview, 
a work sanqple, or aiy other source - is of high -juality if it is simulta- 
neously reliable, valid, and usefiil. By reliable I mean error free. By 
valid I mean that the meaning of the infomation is known and, at the same 
time, is vfliat we intend to use for the kind of evaluation we are undertaking. 
And I mean useful in two broad senses, both in the sense that the infonnation 
serves its purpose - e.g., helps in^Jrove faculty perfoxraance - and in the 
sense that it is cost/effective, in the definition of cost vrtxich includes 
not only dollars and cents but less tangible costs like faculty and student 
morale and institutional image. 

We can evaluate the reliability of our evaluative infonnation in at 
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least three ways, the standard test-retest and internal consistency 
paradigms, and a third paradium upon wliich I tend to put considerable 
emphasis, transferability. 

The test-retest paradigm can discover unreliability in the sense 
that data gathered on two (or more) different occasions differ, given an 
unchanged subject of the data. For example, if a student ratings question- 
naire is given at two different times (same students, same unchanged 
instmctor) and the ratings are different, then to the extent of that 
difference the information is unreliable. Unfortunately it's extronely 
difficult to know which set of data, the first or the second, is the better 
reflection of the true situation. Without an experimental study, all we 
can really tell is tliat there is a difference idiere there should not be. 
(Pd like to interject here that simply giving a ratings questionnaire to 
a class during the fifth and eighth weeks of a term is not sufficient; we 
need to make sure that all T-elevant variables are under control, e.g. that 
the instructor who is being rated has not changed during the intervening 
period. The only design I've been able to tlxihk of is to play the same 
television tape on two occasions and have the same students rate the 
instructor each time. If ratings of this instructor are different on the 
two occasions, there is reason to doubt the reliability of those ratings.) 

The second standard way to study the reliability of infomation is 
to examine the data to see if each respondent was consistent when he should 
have been consistent. For example, if a student ratings questionnaire 
contains a number of very similar questions about a specific instructor 
trait, like organization, and a student's response to those questions is 
highly variable, sometimes high, sometimes low, we might distrust his 
answers. Of course we have to be sure tliat there is no legitimate reason 
for this variability - that, for instance, the variability does not indicate 
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that the instructor's lectures were organized, but his answers to questions 
were disorganized. One might clieck tliis to some extent by comparing one 
student's pattern of responses to tliese organization items to the average 
pattern of his classmates, or tlie pattern of each individual classmate. 
If everyone shows the same pattern of variability, the reliability is 
more likely to be legitimate. Fortunately, a good statistician's standard 
bag of tools can provide this kind of infoimation quite readily. 

The aspect of reliability that intrigues me most is what Cattell calls 
transferability: infomation about the same thing should say the same 
thing, no matter from whom it comes. To use another example from teacher 
rating, if different sources of data disagree - either across sources, as 
when students' teacher ratings and their instructor's self -ratings disagree, 
or within sources, as when students disagree among themselves - tlien I 
think we have prima facie evidence of unreliability. Again, it's hard to 
know which of the sources of data is the more "correct" ; to find this out 
would require an experimental design with an adequate external criterion. 
It might well be that such differences are legitimate, but until the 
legitimacy has been demonstrated the fact of disagreonent should raise a 
flag cautioning possible unreliability. 

IVhat is intriguing about the concept of unreliability is its implication 
for what we usually call "correlates of data." e.g., correlates of student 
ratings. While this corelational infomation is important and useful in 
itself, I think it bccaues still more useful when we look at the associated 
rater variables - like year in school, IQ, sex - as indicators of levels of 
^ xr'bles over which ratings, in order to be reliable, must remain the same. 
Thus, an instnictor's rating is reliable (in this sense) if students of 
various years in school give hijn the same rating, and if students of various 
levels of intelligence a^^ree. A^jain, there can certainly be legitimate 
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reasons for difference in ratings across levels of these variables, but 
my point is that these differences need to be studied. For example, 
suppose for whatever reason female students tended to give their instructors 
more generous ratings on certain items. Consider then the case of two 
hypothetically identical instructors, one whose class is composed of all 
men, the other's of all women. The latter instructor could be rated more 
favorably simply because of this "sex effect". And this phenomenon is not 
restricted to student variables. A similar situation exists for situational 
variables like class size, hour of the day, and i^ether or not the course 
was required of the students. 

One can, however, control these effects either at the item-selection 
stage of questionnaire development by eliminating items which show such 
effects, or at the data analysis stage by statistically correcting for the 
effect, or at the data reporting stage by nonning according to these effects, 
(To the response that eliminating items on this basis risks throwing away 
important information, I go back to the purpose of the evaluation and suggest 
that if the data were being used to develop a theory of instruction, such 
inconsistency would be relevant, and would have to be accounted for, but 
if the data are 1)eing used to make a decision about the instructor, these 
differences are probably a form of unreliability that should be eliminated). 
Fortunately, we have found it quite possible to develop a broad- spectrum 
instructor rating scale even after sex-linked items have been eliminated 
from the initial pool. 

To conclude this discussion on reliability, I'd like to propose an 
ethic: that the required level of reliability varies with the purpose of 
the evaluation, some uses of the data demanding a substantially greater 
freedom from error than others. My own leaning is that evaluation for 
personnel decisions danands the greatest reliability, since the effects 
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of eiTor hei*e are, in my opinion, more severe tlian for any other use of 
evaluation data. 

Validity is the second important quality of infomation: Do the data 
mean what we tlaink they mean, and i» that meaning appropriate for uie use 
to which we want to put the data? Validation can be of at least three 
types. Some degree of meaning can be attributed to data - again, either 
data from questionnaires or interviews, or whatever - by a relatively 
simple inspection of those data. For example, if knowledgeable people - 
experts - agree, on the basis of their total professional experience, 
that items on a ratings questionnaire do measure consequential aspects 
of teaching behavior, then ratings from that questionnaire take on some 
meaning. (Of course, there's the question of the reliability and validity 
of these experts' opinions, but that's another matter.) 

An external criterion can add still more meaning. If student ratings 
relate to the frequency with which students elect further courses from an 
instructor, certain further meaning is attached to the ratings. If how 
much students learn (not necessarily the grades they get) relates to the 
ratings they give, a great deal of important information is added. Better 
yet, perhaps, if patterns of relationships are found betwe'^n various external 
criteria and various different ratings items, more meaning still is supplied. 
By that last point, I mean that a considerable degree of meaning would be 
attached to ratings if it could be demonstrated that, say, student ratings 
of the popularity of an instructor would rc4ate more highly to an external 
(preferably objective) measure of popularity than to indices, say, of 
learning; that ratings of teaching skills would relate more to objective 
learning criteria tlun to indices of popularity, and so forth through a 
series of logical and pedagogically acceptable hypotheses. 

That line of thought leads to tJic third and final aspect of validation, 
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one i^lch is especially useful in the common case in which no external 
criterion is really ac equate. From this process of coxistruct validation, 
meaning is attributed to the data on the basis of infoxmation from a well 
articulated interlocking network of logical and empirical demonstrations 
of meaning. In other words, the process entails setting forth the total 
accumulation of known fact about the data - everything we know from 
research on faculty evaluation - in a logical "If-Then" framework. To the 
extent that "sensible" 4)attems emerge, the data become meaningful, the 
hypothetical construct "effective faculty perfoimance" takes shape. To 
the extent that new hypotheses suggested by the framework are confirmed, 
the data take on still more meaning. And to the extent that facts conflict, 
then either our research or our logic is suspect and the meaning of the 
data is eriCumbered. The articulation of such a framework concerning 
faculty evaluation, I'm afraid, is still rather far in the future. 

JUst as an ethical principle rises from the notion of reliability, 
so too one comes from the idea of validity. Again, and for the same 
reasons , I would propose that the level of validity required of evaluation 
data varies with the purpose of the evaluation, and that data for personnel 
decisions require the greatest degree of validity. But data need to be 
not only reliable and valid; they need to be useful. "Useful" is a very 
broad word in this context. It means first that faculty evaluation 
infonnation needs to work, needs to contribute (at least potentially - that 
is, if people choose to use it) to the improvement of faculty perfoimance. 
Student ratings done to help improve teaching, for example, need to be able 
to help improve teaching. 

In a still broader sense, data need to be useful in cost/effectiveness 
terms. Clearly, we need to consider the dollars and cents aspects of an>' 
system of evaluation. The computer -terminal system that I described earlier 
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for providing infonuation to help students choose courses would probably 
not meet common cost/effectiveness criteria. But the intangible cost of 
any data and of any systan must be studied too. What does it cost in 
teniis of class time to gather data? Is this time well spent? Is there a 
cost to our evaluation in faculty or student morale? Is the image of the 
institution helped or hindered (in the eyes of the public, including 
legislators and trustees, as well as in the eyes of faculty, students, and 
administrators)? All these kinds of questions come under the heading of 
cost of a system, and therefore, utility of a system. 

So, in order to be able to say we have data - or a system - of high 
quality, we need to demonstrate the reliability, validity, and utility of 
the data. 

Related to reliability, validity, and utility are certain considerations, 
that moderate or qualify the data. Three prominent modifiers arc responsi- 
bility, competency, and motivation. For example, a faculty member might 
receive an unfavorable evaluation with regard to the text he uses in his 
teaching or the apparatus he uses in his research. But if all the texts 
in his area are poor, or if the good texts are prohibitively expensive, 
or if the proper apparatus is not available to him, and if he is aware of 
all this, then he cannot be held so responsible for these deficiencies as 
the person who simply isn't able to distinguish good materials frm bad. 
In the same view, the junior faculty member who is required to teach material 
with which he is not familiar and does a poor job is not so responsible as 
his senior colleague who chooses to teach the same course and teaches it 
equally poorly. 

The competency of tiie sources of data to evaluate is another moderator, 
whether it's tlie competency of colleagues who have never set foot in a 
teacher's classroom to evaluate that teacher's teaching, the competency of 
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Students to evaluate the long- tern effects of the instructor, or the 
cop>'»etency.of a chairman whose specialty is vastly different from a 
researcher's to evaluate that research. 

And there is the question of motivation. We find certain phenomena 
in ratii\gs of all kinds - rating \:oo leniently, or rating too harshly, for 
example. But these same kinds of phenanena can affect all kinds of 
(subjective) evaluation data. There can be vested interests or psychological 
reactions of all sorts that usually will manifest themselves as "leniency 
effects" or "stringency effects". Some kinds of statistical machinations 
can reduce some of these effects, but it's unlikely that statistics will 
ever control all of them. Consequently any evaluation needs to consider 
vdiat these moderators can do to the reliability, validity, and utility of 
the data. 

Sources of Measurable Data 

Where do these data that I've been talking about come from? The 
possible sources of infomation about faculty performance are relatively, 
obvious: students (present or previous), colleagues, administrators, 
members of the community, specialiots in relevant fields, and the faculty 
member hiij»self . From each of these sources we can get subjective infoimation 
opinions - about at least some aspect of faculty perfomance. From students 
the information we can get might be either subjective - like ratings - or 
objective - like the performance scores I've stressed. (It is conceivable 
that we might some day be able to get objective infomation from the faculty 
member himself, if there Mere, for example, a reliable and valid "How well 
do I teach" test; but to my knowledge no such test exists today.) 

(It is worth pausing here to dispel too common a misconception about 
the "objectiveness" of ratings. The fact tliat questions are couched in 
"objective -looking" multiple-choice phrasing and can be processed by a 
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computer doesn't in any way alter the fact that all ratings are subjective 
personal opinions. Now, we can certainly decide - and there is nothing 
wrong with this so long as we are aware of what we* re doing - that opinions 
are the data we v»nt for whatever the purpose of our evaluation; in tliat 
case, all we really need to do is demonstrate the reliability of the ratings. 
•If, however,' we want something other than opinion upon which to base our 
evaluation, then we need to relate the ratings to some external and more 
objective perfomance criterion: a learning criterion, perhaps, for 
evaluating teaching, a "correct outcome" criterion for research, and so 
forth, to the extent that criteria can be discovered.) 

When I list these various sources of evaluation infoimation, I do 
not wean to imply that these different types of people are all equal in 
the quality of information they can give about any aspect of faculty 
perfomance; neither do I mean to suggest a preference for anyone over 
another. But I would be extremely interested in seeing a well designed 
transferability study for the evaluation of teaching in which, say, ratings 
from students, colleagues, administrators, and present and former students 
were all compared to one another first in tenns of their reliability and 
second more in:portant in teims of their relationships to an external 
perfomance criterion (student learning) . The reason I emphasize this 
point is that it is entirely too easy to approach stiuient ratings witli the 
stringent set of data-quality criteria that Tve outlined, and simply to 
"badraouth" student ratings. It's entirely too easy to criticize student 
ratings in absolute terns without paying any attention to the quality of 
student ratings relative to each of the other kinds of evaluative infomiation 
available to us. 

But my intent here is not really to hold a brief for any one source 
of data over another only to say tliat no system of faculty evaluation 
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can claim to be complete unless it has seriously studied the data fvm 
each of these sources. I doubt that any of these sources can be safely 
neglected in the research and development of a system of faculty evalu- 
ation, because I imagine that we will find one source most helpful 
for the evaluation oi some kinds of faculty activity, oth«r sources 
better for other kinds. 

In this regard, it's interesting to look within each source of 
data and ask if certain students, certain colleagues, certain admin- 
istrators, and so forth, might furnish more reliable, valid, and useful 
data thaji their peers. It would be a relatively simple matter to manip- 
ulate existing data to discover subsets, say, of students whose opinions 
more than their classmates' relate to a learning criterion. Identifying 
these students in teims of various personality and demographic variables 
could be informative indeed. It might even provide a way of sampliiig 
just certain opinions from future evaluations , those whose judgments 
are probably more sound than their confreres. 
Media for Gathering Data 

The media that are available for gathering and reporting data 
are anotlier consideration. Pencil and paper still seem to be the 
quickest way to provide ijifonpation; the questionnaire is inexpensive 
to provide and to analyze. But questionnaires are not necessarily 
the inost efficient (cost/effective) means of garnering information. 
This is pure speculation, but it's possible that evaluations for 
improving teaching might be better sezved by somj other medium' -e. g. , 
audio- -or video-tapes. 

I n 'ce this allusion to tapes because I suspect that they can 
provide some more meaning fui kinds of information than the usual ques- 
tionnaire. Wl\at niakes lue uncomfortable about questionnaires is the 
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usual way in which data are reported. -Frequency distributions, means, 
standard deviations » and deciles can certainly suranarize a great amount 
of information, but these statistics have their drawbacks. I'm not 
really referring to the fact that many faculty don It understand statis- 
tics or are repelled by them; faculty are educable. I'm more concerned 
about the faculty member who receives low ratings: v^at can he do to 
change? To tell me that I am disorganized is not necessarily to tell 
me how to become better organized, and that fact makes me wonder how 
responsible— as well as how sensitive— we're being when we simply run 
ratings through computers and provide routine statistical analysis. I 
would be most pleased to have access to a Faculty Counseling Bureau 
vrtiere experts in the various arenas of faculty behavior could provide 
reliable, valid, and useful guidance to faculty who are trying to 
iinprove their performance. Some schools apparently have facilities of 
this sort, CXirs has no such formal structure (although there are some 
infoxmal avenues open- -e.g. , -olloagu^s who are willing to share their 
experience and offer suggestions) . so we have been experimenting with 
using the computer to generate prose narratives that expand on the 
basic data of teacher ratings. The computer examines an instructor's 
ratings profile and prints out peisonalized sentences that offer sug- 
gestions for dianging low ratings and that reinforce high ones. But 
the computer approach and the Faculty Counseling Bureiu approach share 
one major weakness: How can we know that the suggestions we offer are 
reliable, valid, and useful? At this point in time, until more research 
is in, all we can do is try to be reasonable, and acknowledge publicly 
that our counseling is highly subjective. 

Beyond tapes, there is personal verbal ccHimunication-- talking. 
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As awkward and frustrating as it might be, as much diplomacy as it might 
sometimes require— there is still no substitute for face-to-face commu- 
nication. While data-quality problens— reliability, validity, utility— are 
extremely hard to deal with in verbal communication, the clarification 
and amplification of meaning and the exchange of views that can be 
accomplished through speech cannot be surpassed by any other medium. 
Some of our faculty have been urging a combined approach to self- 
improvement evaluations in which personal exchanges between students 
and teachers supplement the infoimation gathered by ratings forms. I 
know of no data to support the utility of this approach, but the idea is 
most reasonable and the reports from people vAin have tried it have been 
good. 

Tentporal Considerations 

A time dimension needs to be considered with regard to when 
information is collected and used and how it is reported, I do not 
want to bring up the issue of whether student ratings should be gathered 
before or after exams ; this is largely an empirical question for which 
I have no data. Nor do I want to dwell on the dangers of all instructors 
asking for ratings during a single week so that students might be asked 
to fill out four or five questionnaires that majiy of thm consider 
noxious or inane. This is essentially a question of student motivation 
which I think can be best met by public demonstrations that student 
responses are valuable, that someone pays attention to the ratings and 
that something happens because of them. It can also be helped by 
convincing faculty to use ratings sometime before the last weeks of 
the term so that, fir5t, ratings aren't deemphasized by impending exams 
and second, so that there is at least a chance that the information a 
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group of students provides might be of some direct benefit to those 
particular students « In short, we need to minimize the aversiveness 
and maximize the reinforcement to the respondents. 

But the developmental issue I really want to emphasize is that 
almost any kind of faculty evaluation system attempts to measure typical 
performance as distinguished from maximum performance. We are therefore 
sampling behaviors, and our data might— should, in fact— reflecf. the 
whole range of behavior variation. An instructor might have a great 
day or a lousy one, and ratings will reflect that. He might have a 
great quarter or a lousy one, and ratings will reflect that. I think 
we need to file evaluation data tern by te-m so that developmental 
patterns of evaluations can be studied. Some faculty and some depart- 
ments routinely store such information for this very purpose. It's 
also feasible, in situations where ratings are centrally processed, 
to include the instructor's print-out summaries of past ratings 
for comparison with current ones. The point is that faculty performance 
ought to be examined developnentally, not just at one point in time. 

Let me make one last remark in this respect—one concerning th^ 
transferability of data. If we choose to look at performance evalua- 
tion longitudinally, we need to be sure that the data are transferable. 
That is, since a different class of students is presumably involved 
each term, we need to be sure that any differences (or similarities) 
across terms is due to differences (or similarities) in the instructor's 
behavior, and not due to the changing group of students. 
Consequences of Evaluation 

I have mentioned various groups of people --students, faculty 
administrators, the public, and tlie f'xul ty member himself— as sources 
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of data. Wo need also to look at th^ in terms of the consequences 
they might enjoy or suffer as a result of the evaluation. Favorable 
or unfavorable evaluations might have good and/or bad on.tcomes, and 
these outcomes might affect people in any or all of a number of ways. 
To give a few examples: 
An. uncomplimentary evaluation could certainly hurt the career 
development of a young untenured faculty member. But it^^uld also 
enhance his deyeloxmpnt if it €ffltribatea~to^scine appropriate behavior 



c?iangier*or if it guided— or forced— him into circumstances in which he 
was m|3re likely to be both satisfied and satisfactory. 

A complimentary evaluation, on the other hand, could clearly 
help ponfirm or improve a faculty member's status (and remuneration) . 
But that same good evaluation could also excite the envy of his col- 
leagues, which could ultijnately be more Iiarmful to hijn than a bad 
evaluation night have been. A good evaluation, paradoxically, could 
lead a chairman to "urge" a person, say, \iho loved research but v^o 
liappened to be a good teacher to increase his teaching load at the 
expense of time for research. Or vice versa. 

Consider the chairman of a department. Any kind of evaluation 
may well raise problems for him- -especially evaluations for personnel 
decisions— because any differential treatment of faculty may damage 
morale. Unfavorable evaluations of any of his faculty must be especially 
troublesane for the chairman- -more so tiian for the higher-level 
administrator- -because the chairmfin is most likely the person with the 
immediate responsibility for painful decisions (e.g., firing a colleagiie 
or refusing him a pay increase) or even for pointing t;ut deficiencies. 
(Some chairmen, though, I'm told, Jiave learned to cope wonderfully 
weillj . • -89 
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The student, too, runs sonie risk— though apparently less than 
either faculty or administrators. (Perhaps this might have something to 
do with students being generally more in favor of evaluation than either 
faculty or administrators!) It's hard to see how a student could be 
endangered by providing a favorable evaluation (assuming the evaluation 
is of high quality) . But it unfortunately does not strain the ijnagina- 
tion to think of unpleasant consequences that might befall the students 
if evaluations they gave were highly uncomplimentary. Hopefully this 
distasteful situation is less common than the emphasis on anonymity 
in ratings would lead us to believe. On the other hand, one would 
escpect students to profit over the long run from any kind of high 
quality evaluation of teaching. For that matter, it would not be hard 
to build the argument that any faculty member-"*and the entire academic 
comrajnity- -would profit over the long run from high quality faculty 
evaluations. 

But enough about consequei.'zes. The human ego is of such complexity 
and creativity that no adequate listing of the possible consequences of 
evaluation seems possible. It»s enough at this point simply to express 
the concern and to try to anticipate the most likely consequences. 
Institutional Goals 

The final set of considerations I want to discuss pertains to 
the goals of the institution, either as they pertain for the institution 
as a vhole or for any part of the institution- -division, department, 
prograjn, or course. I would think that the goals of almost any insti- 
tution would include at least some degiee of teaching, advising, research, 
governa* .c, and public service. If this is the case, or vdiatever i"he 
institutional goals might be, however general or specific, I think 
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evaluations no«l to be considered in light of those yoals. All I mcaii 
here is that in a school with primarily an instmctional emphasis, it 
doesn't "coiint'*^ so much that a faculty member may be an excellent 
re'jearcher; he needs to be a good teacher. Conversely in a research 
institute, the faculty member's teaching is of less conconi than his 
research. And in land-grant colleges the public service role is perhaps 
more prominent than in private colleges. I'm suggesting here the need 
for a correspondence between the institution's goals and it's members' 
behavior. But the other side of the coin is appropriate too: when a 
menjber's goals, manifested by his performance, are different from tlie 
institution's, both parties need to assess the legitimacy of their 
priorities. Thus a person in a small college who is a skilled researcher 
but a ix)or teacher might decide to move to a research institute (or a 
research job in a teaching school) ; or the school might decide that 
research is a more tenable goal than it had previously believed. Thu:> 
some major universities have reminded thanselves of the place of 
teaching in the list of institutional priorities. 
Conc eptual Schema 

We've spent a substantial amount of time taking about eight 
different kinds of considerations that desewe attentioii hhen we plan 
or study systems of faculty evaluation, L don't suggest that these 
eight encompass all the considerations there are, nor do I consider 
all eight equally important; but I do believe each 'nerits attention. 

Ill the time rcanaining, I'd like to try to build a conceptual 
schema tliat takes account of ail these considerai ions. What I've 
really nee\i trying to do this morning is not just discuss some random 
concerns about faculty evaluation, but to lay out in a more or less 
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organized fashion these different considerations. Now I want to try 
to draw them together. 

My basic tools are 2 or 3 dimensional figures --after the fashion 
of Cartesian coordinates. 

Insert Figures 1 § 2 about here 

Let's fill in the coordinates. On the X axis, we can list the 
different aspects of faculty performance, teaciiing, advising, research 
and publication, governance, and public service, each heading with all 
the specifications I've described. 

Insert Figure 3 aba >. here 

On the Y axis we could add the sources of measurable data: students, 
colleagues, administrators, the public, the faculty member himself, and 
so forth. 

Insert Figure 4 about here 

Thus the upper left intersection refers to evaluation in which the 
students are the source of evaluative infoimation about the different 
components of the faculty member teaching performance and the 
descending cells concern student infonnation concerning advising, 
research, a^id service. 

We can add a third dimension: QuaUty of Data, or reliability, 
validity, utility, and moderators. 

Insert Figure S alx)ut hei'e 
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The top- left-and- front-most cell nw becomes a consideration 
about the reliability of student information concerning teaching 
performance, and the descending cubes refer to the reliability of the 
information students can provide about the other arenas of faculty 
activity. So we* re dealing with a three-dimensional figure that can 
describe X times Y times Z specific considerations about faculty 
evaluation, where each cell represents a "consideration", (There 
are other tilings we can do with these cells, as we'll see shortly.) 

Now I*ra going to break the laws of physics and go into the fourth 
dimension, the purposes of the evaluation. 

Insert Figure 6 about here 

The top red^cell talks about the validity of student infoimation 
about teaching for the purpose of improving that teaching, the middle 
red cell about the validity of that same information for personnel 
decision, and the bottom red cell about the validity of information 
from students 'cr helping other students select courses. 

The green cells down the X column in the top figure then talk 
about validity of student information for improving various faculty 
activities other than teaching: advising, research, and service. 

The green column? along ^ in the middle schema talk about the 
comparative validity for personnel decisions of infomiation about 
teaching gathered from the faculty member himself, his colleagues, 
administrators, and so forth. 

And the green column along Z in the bottom schema asks about 
the reliability, utility, and moderators related to student information 
about teaching intended to help other students select courses. 
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It would be possible to set all eight of our dimensions into 
a schema like this, but the model would be so unwieldly as to become 
Jiseless. So let me just mention the remaining four considerations for 
review: the media for gathering and reporting infonnation (questionnaires 
audio and visual tapes, computers, personal confrontations), the 
temporal component of evaluations (for longitudinal patterns of 
interpretation) , the consequences of evaluation to each of the people 
involved, and the goals ot the institution and the meaning these 
priorities add to or subtract fivjm the evaluative data. Each one of 
the dimensions interacts witli the four dimensions already presented in 
the schema. Since there doesn't seem to be a reasonable way to include 
these in the schema (altliough we could, at the cost of some of the 
interactions, substitute them one by one for Quality of Data on the Z 
axis), I'd at least want to see them included as footnotes to each cell 
in this model* 

« 

I need to make a few remarks about the flexibility of this model 
before going into a brief description of its applications. The model 
I've sketched is based on my own reflections about our institution, but 
any part of the model am be ch:*nged to fit another school. For example, 
the list of faculty behaviors- -axis X-"Can be lengthened or shortened 
or in any other way modified. The Institutional Goals could be 
changed; so could any of tlie other components. (I would, however, 
hesitate to change the Pur{X)scs of Lvaluation or the Qualities of Data, 
except perhaps to iiuik'^ tJiein more specific.) In short, the thrust of 
this whole presentation is that wc iieed to spell out these "considerations 
the ijiiportant components of faculty evaluation and then cast them into 
a schema such as this in order to see in detail the probloiis Uiat we're 
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facing. I'd hate to seo this thrust hindered simply because a few of 
the parts of the model didn't apply to another school. 

For the final few minutes I M like to talk about uses for this 
sdiema. What's it good for? Because it*s only a glorified outline, 
it can do whatever an outline can do. It can provide tlie structure for 
a talk (as it has today, to a large extent) . Or it can guide a litera- 
ture review or a research program, pointing out qucjstions in each of 
the cells that need to be answered by work already done or yet to be 
done. 

The model also seems to be a pcrferful tool for building or for 
criticizing different instruments and, better still, for developing or 
evaluating systems of faculty evaluation. For example, suppose we want 
to develop a student ratings questionnaire. Itm wciting can be guided 
by the first parts of the Faculty Behaviors dijnension. Each of the 
cells on the X axis can hold any number of items that attempt to measure, 
the particular aspect of faculty behavior. Because we want only student 
information at this point, we stay with the first column on Y. The 
item retention mid validation phases of questionnaire development can 
progress (in whatever order) across each element on the Z axis --validity, 
reliability, utility and moderators. Further considerations about the 
items arise as we move across all tlie other dimensions. 

The same approach, using differeni: r'- 4S ;md columns, holds for 
the developfient or criticism of instruments for colleagues* evaluation 
of a faculty member's research, for solf-evalijution of one's own comiaittee 
work, or for any of the other X times Y possible instnnnents. 

An analogous procedure can help us build a systan of faculty 
evaluation. First we could determine with the help of the X axis 



* 

which behaviors we want to evaluate. Then, with Y, we could select 
the sources of evaluative infoimation that seem most appropriate for 
each behavior. We might consider how to gather the information by 
examining the media dimension. And we could move across Z (Data 
Quality) to evaluate this \Aiole battery of data-gathering devices. 
Finally, each of tne dimensions could help us by pointing out further 
considerations that our system needs to attend to. 

I've found these to be the prime applications of this model- 
outlining my own thoughts about faculty evaluation, guiding me through 
the literature, directing our research program, and aiding in the 
developnent and/or criticism of instruments for any aspect of faculty 
evaluation. I want to stress the flexibility of the schema, its ability 
to tolerate more or fewer dimensions and the modification of any of 
those dimensions. And I want particularly to note the fact that the 
further each of the basic dimensions is specified— the more specific 
the listing of faculty beliaviors, for example— the more complete the 
model is and the more it can help make order out of the chaotic field 
of faculty evaluation. 
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miK AFFBCriVB BBIiAVIOR lYiROUQi FBBD&ACK 
Bruce W. Tuclonan 
Rutgers Ibiiversity 

I . Introduction 

What I ani going to do at the outset is present a general model of 
teacher behavior, and talk about some research findings that I obtained 
in my work with my students having to do with changing the behavior of 
teachers by giving them feedback. I am going to describe two studies 
tliat led me into this area of interest and led me to (tvaw some con- 
clusions upon which my operational approach in based. Th^n, based oa 
these findings and some related theoretical notions, I am going to 
present some general rules that I see as descriptive of the change 
process in general and appropriate for changing the behavior of teachers 
in particular. Ani finally, I will describe a specific technique that 
I have developed and begun to use for providing feedback to teachers* 

Let me emphasize at the beginning that I will be talking about 
feedback, not evaluation. Evaluation is a tern that lias many 
connotations, not all of which are part of its technical meaning. But 
I will be talking about feedback, about individuals gaining an awareness 
of their own behavior in order that they might change it in desired 
directio^'.s (with the emphasis on the word "awareness") . 

II. A Model of Teacher Behavior 

Figure I illustrates a general model of teacher behavior which 
helps provide tlie context for an examination of the change process. 
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Tlio focal clement in this inodol is toachor behavior. This relatively 
simple model portrays (1) tcachor training experiences, (2) the 
teaching environment in which the teacher is located, and (3) 
characteristics of the student that confront the teacher, as factors 
that cause the teacher to be what he or she is, a mixture of style, 
skills, attitudes, and so on. Now this person, the teacher - this 
collection of style, skills, attitudes, etc., then goes into a class- 
room and teaches, that is, produces teaching behavior. Hie result 
of that teaching behavior is student outcomes. 

What I am particularly interested in is the connection between 
the teacher as a person and teaching behavior. There is (as you can 
see in Figure 1) a loop connecting the teacher as a person and her 
teaching behavior. This feedback loop must be based on awareness. 
In other woxds , if the teacher monitors her own teaching behavior 
and, by monitoring it becomes aware of what that behavior is and how 
it affects students, then as a result she can make modifications in 
her style, skills, attitudes, and so on. If this monitoring or aware- 
ness does not occur then the likelihood that changes will occur are 
minimal. Since the first thrcs factors (training, environment, and 
students) are things over which the teacher does not typically have 
jfluch control - the students come and go, the training is already done 
for the most part (except for limited in-service experiences) , and 
the teaching environment is quite con5)lex, tJie teacher as a person is 
goii.g to be invariant unless a feedback loop between teaching behavior 
and the teacher exists. Creating that feedback loop, that awareness, 
is what this presentation is about. 
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III. Some Rclcviuit Rosoarcli Findings 
How can teachers be made aware of how they are braving in such 
a way thtit they can change their behaviox in a desired direction? 
Let me describe the first of two experijnents that T completed with 
my students «md colleagues directed to this question. Ihe first of 
these (I'uckmpJi, Hyman, and McCall, 1969) employed Flanders Interaction 
Analysis categories (Flanders, 1965) as the feedback instrument. 
(Hiese categories are shown in Figure 2) . 

Hie first thing to decide is if you are going to give teachers 
feedback in what form should it be given, that is what instrumentality 
should be employed. Obviously you can observe a teacher teaching and 
then sit down with the teacher and, in the course of a discussion, pass 
on feedback to the teacher. But we have an intuitive feeling (that has 
since been reinforced time and again) that in order to affect soraeont's 
behavior, feedback has to be definitive; it has to be concrete. 
Assigning numbers to categories seems to be the most definitive, concrete 
iiifonnation that can be transmitted. Thus, we decided against some kind 
of anecdotal reporting, or rap session, or something like that, and in 
favor of some systematic set of categories about teacher behavior that we 
could report. We chose the Flanders System. 

The Flanders System (shown in Figure 2) has a number of categories 
and breakdowns (or scores), the major ones being teacher talk - the 
number of statements tlie teacher makes, student talk - the number of 
statements the students make, indirect influence - how the teacher 
reacts to what students say or gets them to participate, and direct 
influence - the lecturing and authoritative behavior of the teacher. 
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The Flanders System is a model for infoxmatlon transmission but 
a model for motivation is still required. Ihore has to be motivation 
for change to occur. Simply providing people with infonnation may not 
be sufficient to motivate c ^ange; we might say that infoxmatlon is a 
necessary but not sufficient part of the change process • motivation 
is also needed. Motivation was introduced in this study using a 
prominent social psychological model called cognitive dissonance theory 
(Festinger, 1957). (The tenets of this model appear in Figure 3). 
Basically, cognitive dissonance theory postulates that people are 
motivated to have a high degree of internal consistency, to be 
consistent in tlieir attitudes, behaviors, perceptions, and so on. A 
person i^o thinks of himself as a good citizen believes that he behaves 
in a way that is consistent with that perception. A person idio thinks 
of himself as a good teacher and believes that a ^ood teacher behaves 
in a certain vray, is motivated to behave in a way whicii is consistent 
with those perceptions and expe' tavions. Moreover, where inconsistency 
exists, the person is motivated to reduce it, and may do so by changing 
the attitudes involved, changing the perceptions involved, or changing 
the behaviors involved; in short, by changing some element of his 
psychological system. If a teacher thinks, for instance, that a good 
teacher shoiild not talk much and discovers through feedback that he talks 
a lot, he will experience cognitive dissonance resulting from the incon- 
sistency between the self -perception that he, as a good teacher, does not 
t:ilk that much, and the feedback that he talks a lot. 

What can he do in this circumstance? He can either decide that 
a good teacher really should talk a lot (i.e., alter his expectation 



or self -percept ion) or he au\ bcuin to talk loss (I.e., nltor his 
behavior). In oitl\or event, ho will produce more conaistency, which 
is, according to dlssontmce theoiy, what he is motivated to do. 
Similarly, if a teacher believes that the manner to influence students 
is not to be directive but to be non-dlrectlve (or Indlrective) and 
this teacher gets feedback that he is, in fact, directive, again there 
is inconsistency. The teacher will be motivated to reduce this incon- 
sistency. How can he do thi-s? He can change his idea of what a good 
teacher should do (or what he, as a "good" teacher, does) or he can 
change his teaching behavior. Thus, cognitive dissonance theory says 
that people strive for consistency, i.e., that they are motivated to 
be consistent. If you can (1) show people that tliey are inconsistent, 
and (2) constrain them to deal with that inconsistency so they cannot 
weasel out (because there Is that tendency) , then they will cliange 
some element of that ii» consistency. This Is dissonance theory. 

We new have in the study what we believe to be the two sufficient 
conditions for change: motivation and Infonnatlon. The Tuckman, 
McCall, and layman (1969) study dealt with the variation of both 
motivation and Information. A grot?) of 24 teachers were given a form 
that corresponded to the Flanders categories and asked to estimate 
the percentage of time that they spent In each of the categories. This 
was a measure of self -perception. Then an observer used the Flanders 
categories to code the behavior of each teacher. We now had, on the 
one hand, what the teachers said they were doinr and, on the other hand, 
what they were observed to be doing, and we could determine the degree 
of inconsistency. The 24 teachers were separated into two groups: 
those whose inconsistency was greatest, and those whose inconsistency 
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was least* In one group teachers were doing what they thought or 
said they were doing, and in the other group they were not doing what 
they thought or said they were doing. And, of course, based on 
congnitive dissonance theory, the group with the greatest inconsistency 
was expected to have the greatest potential for change because they had 
the greatest motivation. 

Each of these two dissonance groups was divided into four smaller 
groups, each of which had one of the following experiences. Teachers 
in one of these groups were given the reports of the observer who went 
over each category and gave them verbal feedback. (E.g. » "You say you 
are talking 701 of the time but the data show you are talking 901 of the 
time. You say students are talking 301 of the time buc the data show 
they are talking 5% of the time.") Teachers in the second group were 
taught to use the Flanders interaction analysis coding system and then 
had to code one another. This was done with the expectation that the 
best way to give people feedback would be to give them a mechanism 
for self-feedback. Presumably, a teacher who knew the coding system 
would be giving hijnself feedback all the time - resulting in a powerful 
effect. Teachers in the third condition listened to tape recordings 
of their own classes; and thus had to develop their own feedback and 
self -assessments. Since listening to tape recordings is kind of the 
antithesis of concrete, definitive feedback, it was expected to produce 
little effect. (Somebody could listen to a tape recording of hijnself 
and get no feedback whatever) . Teachers in the fourth condition were 
given no access to any information of any sort. The fourth group is 
what the research designer calls a control group. 

When we looked at the findings in this study (shown in Figure 4) , 
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we fdUnd there were tendencies for the discrepancy botwecu teachers' 
self-procoption and behavior to becoino simiierj a? aad predicted 
the group with the greatest dlscrepimcy had the gx«^iiCosx teiiucncy to 
reduce it. In other words , teachers who had tlie most motivation to 
change were the ones that changed the most. Ihat seems to make 
reasonably good sense. We also found that in the verbal feedback 
condition (i.e., when we sat down and gave teachers exact, quantita- 
tive feedback) those teachers changed; they reduced their discrepancy 
the most. The teachers that were taught the coding system did not 
change. In retrospect, were I to do the study again, I would have 
those teachers who learned the system code themselves from their own 
tape recordings rather than having them code their colleagues as they 
did. patently, even though teache^i know a behavior coding system 
they do not necessarily me it on themselves unless they are put in a 
concrete situation v4i^ftj:hey have to — which is what we did not do. 

It turned out in this study that the motivation factor, i.e., 
the size of the discrepancy, primarily affected teachers* self-perceptions . 
Teachers who had the greatest discrepancy were the ones vibo changed their 
perceptions most. In a sense you might say that that is a kind of cop-out. 
If a teachers thinks that a good teacher should not talk very much and 
then finds out that he talks a lot, he then decides to say that he really 
does talk a lot but that is O.K. This was the nature of the finding. It 
did not occur for teacher talk, only for what is called the indirect ratio 
The indirect ratio is a very complicated notion containing many elements all 
dealing with how the teacher reacts to students. Because of its con^jlexity, 
it is not surprising that teachers did not change their behavior on the indirect 



108 



0» 



ratio; rather, thoy dealt with discrepancies on it by changing thoir 
sel {-perceptions. 

Our primary interest here is not in changing solf-percoption, 
however, but in changing behavior. Teachers did significantly change 
their behavior as a result of the verbal feedback. Ihoy changed their 
behavior in terms of the one element in the coding system over vihich 
thoy had the greatest degree of control: the amount of their talking 
in the classroom. They actually talked less. There was a very strong 
tendency for teachers undei«ivaluate their amount of talking; in 
other words , teachers typically believed that they talked less than 
they actually did. Tlie resultant discrepancy motivated teachers in 
the verbal feedback condition (who were aware of the discrepancy) to 
talk less. The size^ of the discrepancy did not seem to matter; it 
happened pretty much across both discrepancy gi-oips . Hius , teachers 
in this experiment talked less as a re<%ult of feedback. 

At this point we began to think we had something, so vre started 
out again in a somewhat different direction. In the second of the two 
experiments (Tuckman and Oliver, 1968), we used a different strategy. 
We had, in the first experiment, looked at motivation and the particu- 
lar kind of information, i.e., the Flanders system as judged by trained 
observers, as change factors. In the second experiment , we looked at 
student judgements oi a student opinion questionnaire as the source of 
feedback, and instead of looking at motivation per se, we looked at the 
source of feedback as a change factor. V»ho did t!»e feedback come from? 
TVo sources were investigated, each alone and in confcination. One of the 
sources was the teacher's supervisor (in most cases assistant principals) 
and the other source was the teaclier-s students. 

109 



Four grouts were used: one group got feedback from students only; 
one grovp got feedback ftom supervisors only; one group got separate 
sets of feedback from both students and supervisors ; and one group got 
no feedback (the control group) . Feedback was given on an instru/nent 
called the Student Opinion Questionnaire (SOQ) developed by Bryan 
(1963) and shown in Figure 5. It is an instrument usually filled out 
by students but actually anybody can use it» supervisors or students. 
Feedback was the same in all cases, so everybody was getting the same. 
The initial judgments uf students and supervisors on a particular 
teaclier did not differ so that regardless of what source a teacher was 
getting feedback from, the feedback was essentially the same. 

The feedback had to do with the following ten areas (see Figure S) : 
the knowledge the teacher has of the subject taught, his ability to 
explain clearly, his fairness, his maintenance of discipline, his under- 
standing, how much you are leaxnisig, *'interestingness'* of the class 
efficiency and businesslike manner of the teacher, skill in making 
students think for themselves, and the teachers' general, all-around 
teaching ability. These are global kinds of judgments but they still 
give you nurobers. And the niarbers were put on a graph, and the teachers 
given a profile of how they were seen in each instance. Incidentally, 
the SOQ has some open-ended questions on the reverse side which were 
not used in the analysis but made available to the teachers. These 
items provide a place to write in viiat you especially like about the 
teacher, how you think the teacher should inq)rove, what you especially 
like about the course , and how you think the course could be improved. 
No attempt was made to quantify this information. Teachers yteve 
further separated into three groups based on how long they had been 
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teaching. 

Vam for the findings (which are shown in Figure 6) . First of 
all, the feedback ch&nges were in the negative direction in every 
case. Vlhat that means is not that the teachers were necessarily 
worsening over time; since it occurred in every condition » you have 
to conclude that something else was going on. Vlhat seemed to be 
going on is that If you asked students to judge their teachers in 
February and you asked those same students to judge their teachers 
in June, they would be less positive about those teachers in June. 
Perhaps spring captures their fancy, or possibly since the teacher 
is about to give them the grade, they see this as a chance for 
retribution. At any rate, it is an end-of-year effect. Essentially, 
the students become increasingly negative toward teachers in all 
conditions over time, but we still could evaluate the outcomes of 
the experiment in terms of v^ich condition had the least tendency 
to become negative and which had the most tendency to become nega- 
tive. 

Ihe only feedback that had a positive effect, that is, that 
minimized the negative effect, was feedback from students. Feedback 
from supervisors, even though it was the same feedback as from 
students, moved the teachers in the opposite direction to that 
advocated by the feedback. If the supervisor said you are not fair 
enough, for example, the teacher became less fair. If the supervisor 
said you are not efficient enough, the teacher became less efficient. 
In each case the supervisory feedback caused the teacher to change 
in the opposite direction, whereas the student feedback was followed. 
In other words, the teachers changed in the direction advocated by 
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student feedback and in the opposite direction to that advocated by 
supervisors (even though both gave identical feedback) , And vdion 
you gave thorn feedback from both students and supervisors, the result 
was pretty much the same as from students alone. The supervisory feed- 
back in this case has little effect. 

This finding led me to become very concerned about the source of 
feedback per se, particularly since the supervisor plays an evaluation 
role, and evaluation, because of its personal, career relevance con be 
veiy threatening. However, I am not interested in evaluation; I am 
interested in feedback. I am interested in getting people to change 
based on their own inherent motivation. It is an internal process; 
feedback information is not for publication. The studies I have described 
were feedback studies, not evaluation studies. Teachers were assured 
that data would not be put into their files, i.e., that nobody would 
have access to the data. And yet, the data lead me to believe that the 
supervisor is viewed as an evaluator. Because it is vexy difficult for 
the teacher to separate the supervisor's role as an evaluator on the 
one hand, and as a source of non- threatening fee<-<jack on the other, it 
would be very difficult to make the supervisor part of the feedback 
process. 

Data concerning the years of experience of a teacher were examined 
on the hunch that there might be a difference between more esqjerienced 
and less e:q)erienced teachers vis -a vis their willingness to change. 
We did find effects that were not strong enough to be called significant 
but strong enough to bear repeating. The tendency we observed for 
the teachers with the least experience was to be most resistant to 
supervisor feedback, and the teachers with the most experience to be 
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most resistant to student feedback. That is, the younger teachers were 
nwre receptive to student feedback, and th© older teachers wore recep- 
tive to supervisor feedback. This is interesting because the older 
teachers (older meaning eleven or more years of experience) were tenured 
and thus had no great threat associated with the supervisor. For these 
teachers, the supervisor was probably viewed as giving feedback and 
not doing an evaluation. This confirms my earlier supposition about 
what is happening when the supervisor reacts to the teacher. 
IV. The Change Process : Rules of Effective Feedback 
It may seem presun?)tuous to see someone attempt to explain the 
feedback process based on two experiments, but nevertheless I will 
try. Being in somevdiat of a hurry to get out into the real world 
to actually see vdiat kinds of changes can be produced (and having 
devoted much time to these experiments) , I atte.npted to put together 
what I consider to be the twelve rules of effective feedback. These are 
shown in Figure 7. Let us consider each in turn. 

The first of these twelve rules of effiactive feedback is that 
feedback must involve concrete behaviors or characteristics. If you 
want to talk about things that a teacher can understand and relate to, 
you have to make the feedback as concrete as you can. That is vfliy 
numbers (i.e., quantifications) help. If you talk about this much of a 
quality now versus that much of the quality then, it becomes easier to 
communicate the information. Or, alternatively, you can say you think 
this much of the quality is good but you only have that much of the 
quality. You can bring the feedback to bear much more easily if it is 
concrete. 

Secondly, the feedback must provide clear, incontrovertible 
evidence of exactly how you appear to behave. After it is given, if 

113 



the teadier can say that that is your opinion, then it is not incontro- 
vertible. It is important, therefore, to think in terms of a feedback 
system where the evidence is strong and compelling in order that it 
be accepted by teachers. 

The third rule is that the feedback source must be reputable and 
believable and his or her intentions accepted. To a large extent this 
may eliminate supervisors. I do not think that teachers question 
their reputability and believability as much as they do their intentions. 
I think there may be a greac limitation upon si5)ervisors within the 
feedback process ; this is not to say that supervisors cannot play a 

role in the feedback process but the issue of intentionality must be 
dealt with. 

The fourth rule is that feedback must be in terns that the teacher 
can understand and relate to. One of the problems with the Flanders 
system is that the ratios, for example, are not easily understandable. 
Teachers cannot (as shown in the first study) behaviorally change these 
.ratios. After all, v^o can keep in mind the three terms of the nunera- 
tor and. the one terra of the denominator and change the three of the 
numerator up and the one of the denominator down all at the same time? 
It is just too conplicated and thus not likely to happen. 

The fifth rule is that the feedback recipient must have a clear 
ideal model of v^iat his behavior or characteristics should be. If 
we are going to try to motivate teachers by creating some state of 
dissonance or discrepancy between the way they are perceived and the 
way they want to be, we must make sure that they are clear about the 
way they think they should behave. 

Ihe feedback recipient must also know what others ejqpectations 
of him are (Rule 6) . I think that that is an important factor in 
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fomilating a personal model about what kind of a teacher you want to 
be. In other words, what kind of a teacher you want to be imist be 
based in part on what kind of a teacher people expect you to be. 
Obviously, you cannot behave in a way that meets students' judgments 
unless you know what they expect of you. If your peers are going to 
provide you with feedback, you mst know their expectations. If 
your supervisors are going to provide you with feedback, then you must 
know their expectations. 

The seventh rule is that you must make a commitment as to the 
way you would like to be. There must be a commitment in this system 
somewhere, otherwise you can weasel out. You must say at some point 
or another, 'This is v^t I want to do. I don't want to talk so much. 
Talking so much is not good teaching." That is a commitment. It is 
like Weight-Watchers, or Snoke-Enders \4iere your conmitment is partly 
based on the money you pay. For $70 you might give up almost anything. 

The eighth rule is that you must also make a public commitment to 
change (another similarity to the procedures used by Weight-Watchers 
and others) . You cannot mumble this comaitment under your breath so 
that nobody hears it, because if you do, it may be the kind of a 
conmitment that you give up when the going gets rough. It must be public. 

The ninth rule is that the feedback must create tension. That is, 
it must be dissonant with your self-perceptions or ideals and it must 
be internalized. This gets back to the idea of motivation. If you 
think something about yourself and you get feedback that confirms it, 
you will not change, and appropriately, you should not change - there 
is no tension. If you want people to change, you have to find out 
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ways of giving than feedback that is inconsistent or dissonant with 
the way they see themselves, thereby creating the tension for change. 
That may mean having a fairly flexible feedback system - something to 
keep in mind. 

The tenth rule is that the reception of feedback must not involve 
more than low risk, i.e., support should be provided. This is very 
important. Feedback, in any aspect of life - professional, avocational , 
etc., is not easy to accept. It is not easy for people to tell others 
how they see them and it is not easy to hear it, especially if it is 
not consistent with the way someone sees himself. Since feedback is 
something that does have a degree of inherent threat, one of the rules 
must be that at the same time you give feedback, you must provide some 
kind of support. 

The eleventh rule is that models for change and for the support 
of change must be provided. A feedback system must be part of a 
model, that is, it must relate to other aspects of teaching behavior, 
and there must be the possibility to generalize from it. If a feedback 
system does not provide the possibility to generalize from it, the 
kinds of changes produced may be very finite and limited, as opposed 
to actually producing major changes in a teacher's teaching philosophy. 
In other words, a feedback system must deal with teaching philosophy. 

And finally, the twelfth rule of feedback is that accountability 
(by now one of your favorite words) to your group must be maintained 
through continuing feedback. When I say accountability I mean account- 
ability to the people who are providing you with the feedback. And 
in the model that I will advocate (later on) , the people who will provide 
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the feedback will be other tcadicrs. In accountability, us I sec it» 
you make a public comaitment to your peers tliat you will attempt to 
accept and use their feedback, and in turn have accountability to provide 
them with feedoack too. This is a kind of accountability thai I believe 
can be lived with. . It is accountability based on the fact that you aru 
asking people to help you, to contribute to your growth and development; 
therefore, you have a responsibility to give this feedback a serious 
try. 

V. Hie Change Environment 

Since feedback as an element for change occurs in a total environ- 
ment, I would like to talk for a moment about vdiat I call the change 
environment. The change environment is a critical component of change. 
Nothing will happen unless the environment has those characteristics 
that contribute to the change process. The components of the change 
environment that I identified are shown in Figure 8. The change environ- 
ment, first of all, must have newness. If you are going to change, you 
obviously have to have something to change to; some "innovation." And 
if you have to change to be doing it, then that "some tiling for you is 
new. Be it accountability or behavioral objectives, they are new; feed- 
back from peers is new; team teaching may be new for you or for your 
system; non-grading, differentiated staffing, and so on. All I am saying 
is that a critical element for change environment is having something 
to change to v^ich will be new for the potential adopter. 

The change environment must also contain the element of compelling 
reality. This is unfortunately a "negative" aspect of the change environ- 
ment, but it has to be present. This is the "shotgun." This is the 
father who comes rapping on the door of the young man who just left the 
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hay loft vdth his daughter and delivers an ultimatun. Ihat is com- 
pelling reality and it will motivate that young man to the alter in 
some short space of time. As negatxvQ as it might seem if you want 
to affect behavior, there has tc be some kind of constraining or com- 
pelling reality. Threatening to bum down the school, for instance, 
is in some ways a very effective compelling reality. There is no way 
to get around the fact that that is going to get your blood flowing. 
It produces the kind of threat that does unfortuiiately seem to contri- 
bute to the change environment. If everything is nice, happy, pat-on- 
the-back, we-are-all-in-this-together, we-are-going-tQ-make-better- 
schools, then in my perception, nothing happens. Compellingness has 
to be produced by someone, be it the board of education, the superin- 
tendent, the principal, subgroup of teachers, the parents, or the 
students; someone has to hold a shotguif to the group that they are 
trying to change. That is the compelling reality. 

Ihe third element is called open participation, that is an honest 
opportunity to contribute to the change decision, and an honest will- 
ingness to be a part of the process of change. In the case of teacher 
feedback, this means saying: "I want to know how you perceived me and 
I am willing to tell you how I perceive you." And that kind of open 
participation represents a risk, and there is no way to finesse that 
point. It does not matter vflio you are; open participation is risky. 

The first three elements - newness, compelling reality, and open 
participation, all represents risks of a sort and might be considered 
negative elements in some sense. On the more positive side are the 
last two elements both of which help us live with this risk and be 
willing to take it. The first of these is a problem focus, that is. 
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the realization that what Is called for is problem solving. We must 
realize that the tools and the skills of problem solving can be 
brought to bear in dealing with the problem that is prompting the 
change (and be allowed to use these tools and skills) . In other 
words, we are rational people to some degree and our problems can be 
solved rationally. That focus is a critical part of the change pro- 
cess . 

Finally, we have the element of support. Risk reduction and 
the maintenance of the entire system are dependent on what I call 
the group-and-leader. It may be an informal group like the wildcat 
strikers who are meeting in the basement of somebody's house to plan 
their next strategy or a board of education caucus, or it may be a 
formal group like the entire board of education or the teacher's 
union. At some point within the change process, there are groups 
that foim and leadership that emerges, and these groups are a source 
of support. You can lean on them when things get rough* However, 
these same things can also be a oppositional force to the change pro- 
cess; they can provide the greatest resistance to change by using 
their support mechanism to avoid it. When that happens, the group is 
beginning to deny open participation. As soon as the group denys open 
participation, one of the elements of the model disappears and therefore 
change is not going to occur. 

What I am saying is that you need all five elements of the change 
environment for change to occur. You can't have four of them, or three 
of them; you need all five of them. The newness, the innovation, pro- 
duces the challenge (or you may call it threat) . Compelling reality, 
the burning building, the subtle edict or whatever, provides the 
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confrontation* Open participation allows for specific feedback, Ihe 
problem focus creates a problem-solving-orlentation, and the group- 
and- leader provide support for risk reduction. Taken together, these 
factors create in people a willingness to experiment, that is, to txy 
things out, a willingness to show themselves, and a willingness to be 
receptive to others. And these are ultimately the major ingredients 
of learning and growth. This is perhaps a somevdiat abstract and ideal- 
istic conception of change but it does provide a reasonable point of 
departure into the specific mechanisms for changing teacher behavior 
throu^ feedback. 

VI. Ihe Tuckman Teacher Feedback ^)rstem 

Let us move on to the last step by putting together the data and 
intuitions from the two escperiments along with some of the more general 
concepts that I evolved from them. Let us consider a feedback 
system that hope&lly would become part of the larger educational 
system and help teachers to change their own behavior. I designed a 
form for this purpose v^ich I called the Tuckman Teacher Feedback Form 
(or TTPF) . (I figured that if Flanders could have a Flanders Inter- 
action Analysis Foim, then Tuckman could have a Tuckman Teacher Feed- 
back Form. Certainly, nobody else was going to call their form the 
Tuckman Teacher Feedback Form) . 

The TTFF began as a rather long laundry list of adjectives each 
of which somehow seemed to describe a hunan element in behavior and 
each of which was paired with an opposite, e.g. , original-conventional, 
passionate-controlled, impertinent -polite, patient- in?)atient, cold-warm, 
initiating- defeirent, and so on. I purposely tried to use adjectives 
that describe the human element in teadiing. It seems that we have 
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many other ways to evaluate^ or provide feedback about » the curriculum, and 
there are many ways to provide feedback or accountability in terms of 
student performance. But after all is considered, the teacher as a 
human being still has the human element as a unique quality of teaching 
and most of the existing feedback or evaluation systems make no attempt 
to assess it. I am not presun^tuous enough to think that I know how a 
teacher should be on these human elements (although, I think if we were to 
discuss it for even a few minutes we would reach a high degree of agreement) . 
The point is that in my system every teacher has the right to specify what 
he thinks the good teacher should be, that is set his own goal, and work 
toward it. Moreover, the feedback is referenced only in terms of his 
own goal. So I am not imposing an arbitrary standard on teachers but 
attempting to Introduce or reintroduce the human element back into 
teaching. 

As I said, I began with this long laundry list of adjective pairs 
that I more -or- less picked out of the air. And idien I thought that I 
had a long enough list, I recruited 80 of my students v^o were also 
teachers, administrators, or full-time graduate students at Rutgers and 
asked them to use these adjective pairs to rate one of their graduate 
instructors. I used a statistical procedure called factor analysis to 
analyze the data they provided. Factor analysis is a procedure that 
allows you to tell numerically or quantitatively ^en different things 
apparently mean the same thing to the same person. In other words, I can 
use the adjective "original" and 'Creative", and mean different things 
by them. Factor analysis can tell you the extent to Whidi people in fact 
mean the same thing by these two terms by determining whether they use 
them in the same way when they are judging someone such as a teacher. 

• 12i 
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(The scale or form I came up with using this procedure, the TTFF, appears in 
the Appendix.) Remeniber that I made no attenpt to be systematic in selecting 
these 75 adjective pairs to begin with; however, people just do not use 75 
pieces of information to describe teaching behavior (or any kind of behavior 
for that maticer) • It is much too many. Hie factor analysis reduced the 
75 adjective pairs to four factors. Hiere were just four factors or 
clusters of meaning in this v^ole laundry list. Each is shown in Figure 9, 
and will be briefly described below. 

Hie first factor I called creativity. Ihe teacher \iio was creative 
was imaginative, experimenting, original, iconoclastic, uninhibited, 
adventurous, flexible and initiating, in constrast to the noncreative 
teacher v^o was loutinized, exacting, cautioua, c(mventional , ritualistic, 
inhibited, timid, dogmatic, and deferrent. Those pairs of words meant the 
same to the student judges (as evidenced by the factor loadings in the 
factor analysis) , and I chose the term "creativity" as a way of trying to 

» 

label vdiat those words seemed to have in common. Hius, the student judges 
first reacted to the creativeness of a teacher, and seemed to do so in very 
personal terms. 

I had a little more trouble naming the second factor. I called it 
dynamism. It seemed to me to be a confcination of dominance and energy, 
and so I called it dynamism because I did not want to use a word that 
conveyed just energy. Dynamism has within it (according to the analysis) 
boiyant, extraverted, bubbly, and outspoken, all of vdiich seem to refer to 
a teacher's energy level. This factor also includes aggressive, assertive, 
dominant, and direct which are dominance terms. It seems to mix together 
two qualities that I had thought were separable but were not distinguished 
by the judges. 
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Ihe third factor I called organized demeanor; again I used a 
somewhat obtuse label rather than simply calling it organization. I 
did this because it includes more than just terms that refer to organ- 
ization. True» it includes systematic » organized, puiposeful, resource- 
ful and knowledgeable on the one hand» but it also includes in control, 
sophisticated, observant, and conscientious on the other hand. It is 
more than otganization; it is organization plus self-control. 

Finally, the fourth factor I called warmth and acceptance. That 
describes the warm, sociable, amiable, patient, fair, gentU ^ accepting, 
thoughtful, polite teacher as opposed to the cold, unfriendly, hostile, 
impatient, unfair, harsh, critical, inconsiderate, inpertinent teacher, 
(Each of these factors could have included other words had I Introduced 
otlier words into the laundry list.) 

Hiere is a specific scoring procedure for the TfFF based on a 
scoring foim (which I have included in the Appendix) . All of the 
items on a factor are not included in the scoring in order to avoid 
unnecessary redundancy. Also, the adjective pairs are written in both 
directions. (As you can see on the TTFF, some have their "positive" 
end on the left, some on the right.) This is just a good measurement 
strategy. If you put the positive adjectives on the left all the time 
and the negative ones on the right , someone can fall asleep and mark 
the left end on each and you turn out to be the greatest teacher in the 
world. Since no one wants falling asleep to be a factor, the items are 
wiltten in both directions. As a result when the TTFF is scored, either 
tlie positive or the negative items must be turned around or scored separately. 
When that is done, a constant must be added so that the lowest score a 
person caii get will be "I". This is done bec^i^«it is much easier to deal 
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with positive nunbers than negative ones. The scoring procedure looks more 
coinplicated than it really is. The scoring form is quite explicit and, ^en 
used, makes the scoring process quite mechanical. Once the scoring is 
completed, the profile of the teacher on the four dimensions is plotted 
at the bottom of the scoring fom so he can see how he has been judged. The 
resulting line between points provides the teacher with a basis to react to 
himself. 

The steps in the total feedback system that I am proposing are shown 
in Figure 10 and described below. The first thing I would do in the feed- 
back system is ask the teachers to fill out the TTFF describing **The Good 
Teacher." They may be describing themselves; I am not quite sure whether 
that matters. I am not willing to say at this point that the higher you 
are on these four dimensions the better a teacher you are. It may not be 
that simple. I would rather the criteria be vdiat you yourself think a good 
teacher is, or i^at we agree consensually that a good teacher should be. 
Remember that nobody is being forced to change by this system so the more 
points of reference there are the better. Remember also that the basis 
for change is to be dissonance - dissonance between v^at you are and vh&t you 
think you are. So I would begin by having a group of teachers fill it out 
on 'The Good Teacher." Six or seven teachers within a school might be 
involved and each would be asked to fill it out on his own. 

Then, the teachers would be given the opportunity to observe one 
^tuther. This can be done by sitting in on one another's classrooms, or, 
if the facility exists, using closed- circuit television or video tape.. 
Regardless of how it is done, the fact remains that you cannot judge a 
teacher's behavior unless you observe it, whatever the inconvenience. When 
a teacher is out of his own room you have to bring in a substitute. The 
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process, as you will see, may also require some in-service time. 

'then each teacher is Riven a consensus summary statement of ratings 
of hira by tlie other teachers in the group, so he knows how his teaching 
behavior is perceived. At the same time, the teachers involved meet as a 
group to discuss the feedback. This is done so that the feedback is not 
conveyed by just an inqjersonal sheet that you find in your cubby -hole 
mailbox one day. It is not meant to work that way. Hie feedback is given 
in conjunction with group process. 

In the next step, the teachers engage in what I call strength 
training (for want of a better term, and since somebody has already coined 
it that) . Now that you see from the feedback forin what your deficiencies 
are, you ask yourself what you can actually do in the classroom to overcome 
them. In strength training you learn how to create new strengths for 
yourself. You do this by discussing yoiar deficiencies with one another and 
giving one another specific ideas about how to convert them into strengths. 
The teachers can even role play these new strength techniques on one another. 
At the same time, they try out these new strength techniques in their 
regular classes. Take dynamism, for example. If the teacher is not seen 
as being as energetic as he or she would like to be, the other teachers in 
the groq) might point out certain things about movement and modulation 
of voice and activity level that might make strengths out of these weaknesses. 
The teacher can then try these things out in her actual classes. 

And finally, the teachers then observe one another a second time to 
provide a basis for determining whether there has been a changp in behavior 
in the recomnended direction. 

VII. Summary 

This paper has covered a lot of ground. It began with a model of 
teacher behavior that linked the teacher to his own behavior through 
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awareness based on feedback. Two studies followed tliat showed that teachers 
would change their behavior based on feedback information telling them 
how they were perceived. These studies also indicated that dissonance 
between self-perceptions and the perceptions of others was a motivator of 
change, and that supervisors, traditional sources of "feedback" to teachers, 
had little effect. 

Based on these studies, 12 rules of feedback were presented as a 
kind of operational philosophy of changing teacher behavior. These rules 
were further generalized to provide a conception of the change environment - 
those conditions that must exist for change to occur. Finally, the feedback 
rules and the change environment characteristics were incorporated into a 
total teacher feedback system (which I named after myself) which incorporated 
a feedback fonn and scoring system designed and analyzed for the purpose of 
providin^^ teachers with the kind of information about themselves on i^ich 
change could be based. Hie iiistrumentation was further nested in the group 
process to provide the mechanisms for change required by the change environ- 
ment. Hie obvious next step is to try it out. This is now in process. 
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TEACHER 


INDIRECT 
INFLUENCE 


1. *AC:aiI''fS miLIN(J: accepts tuid clarifies 

feeling tone of the studojits in a non- 
threatening manner. Feelings may bo 
positive or negative. Predicting and 
recalling feelings are included, 

2, *PI^wSES OR ENCOURAGES: praises or 

encourages student action or behavior* 
Jokes that release tension > not at the 
expense of another individual, nodding 
head or saying "uhliuh?" or "go on"are 
included. 

3, *ACCEPrS OR USES IDEAS OF STUDENT; 

clarifying, building, or developing 
ideas or suggestions by a student. 
As teacher brings more o£ his own 

4. *ASKS QUESTIOJS: asking a question about 

content or procedure vdth the intent 
that a student answer. 


TALK 


DIRECT 
INFLUENCE 


5. *LECTURES: giving facts or opinions about 

content or procedures; expressing his own 
idea; asking rhetorical questions* 

6. *GIVE DIRECTIONS: directions, commands, 

or orders with vMch a student is 
expected to comply. 

7. *CRITICIZES OR JUSTIMBS AmHORITy: 

c ^ om iwn^ c n ^ oti<) i^li Qt\ cro c ^ ^Afvn^ 
2> Let CClildlCd XIlCGnvlCVl VAiaiiJ^C dvtAUviiU 

behavior from non-acceptable pattern; 
bawling someone out; stating i^y the 
teacher is doing he is doing, ex- 
treme self-reference. 


STUDENT 
TALK 


8, *STUffiNT TALK-RESPONSE: talk by students 

in response to teacher. Teacher 
initiates the contact or solicits student 
s ti&t cni6iit 

9. *STUDENT TALK- INITIATION: talk by students, 

which they initiate. If "calling on" 
Student is only to indicate v^o may talk 
next, observer must decide whether 
student wanted to talk. If he did, use 
this category. 




10. *SILENCE OR C(»JFUSION: pauses, short 

periods of silence, and periods of con- 
fusion in which comrnmica ion cannot be 
understood by the observer. 



Figure 2. Summary of Categories for 
Interaction Analysis 
(Flanders, 196S, p.Jglg 

o 

ERIC 



DISSONANCE TIIEORY 
Beliefs about ourselves and our own behavior are potentially 
dissonant if we behave in ways that are discrepant from or 
opposite to the ways we believe we should or do behave. When 
we are made aware of this discrepancy (or consciously create it) , 
dissonance is produced. 

The amount of dissonance produced is a function of (A) the 
importance or centrality of the self-beliefs in question, (B) 
the extent to which the evidence of the discrepant behavior is 
incontrovertible and (C) the magnitude of the discrepancy between 
belief and behavior. 

The presence of dissonance gives rise to pressures to reduce it 
(proportional to its amount) because it is unpif«isant to 
experience. 

Dissonance can be reduced by (A) changing our beliefs or percep- 
tions of ourselves to bring them more in line with our behavior. 
(B) Oianging our heavier to bring it more in line with our 
beliefs, (C) finding other evidence of our behavior which is 
more consistent with our beliefs , or (D) otherwise rationalizing 
or compartmentalizing the two so that nothing need change (such 
as negating the legitimacy and accuracy of the evidence about 
our behavior) . 

People can tolerate some degree of dissonance without changing 
but when the dissonance reaches a critical level, something must 
change. 

Figure 3 
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Analysis of Mean CSiange Scores: 
DISCREPANCY SCORE QiANGE (TOTAL) 





Verbal 
Feedback 


Interaction 

Analvsis 


Tape 




High 

Discrepant 


91 


61 


40 


30 


Low 

Discrepant 


33 


-1 


2 


1 




62 


30 


21 


15 




1 


.01 


i 



SELF-PERCEPTION CHANGE (INDIRECT RATIO) 



High 

Discrepant 


51 


30 


16 


13 


Low 

Discrepant 


15 


-7 


6 


4 




33 




11 


9 


BEHAVIOR CHANGE (TEACHER TALK) 








Verbal 
Feedback 


Interaction 
Analysis 


Tape 

Recording 


Control 


High 

Discrepant 


9 


2 


0 


-5 


Low 

Discrepant 


4 


-2 


-4 


-6 




7 


0 


-2 


-6 






.05 





Figure 4. Changing Teacher Behavior Through 
Dissonance and Different Forms of 
Feedback. (From Tucknan, McCall, 
and Hyman, 1969.) 
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Fijiure b. Tne btudent-upinion Questionnaire 
(Bryan I 1963 « p. 53). 



Please answer the following questions honestly and frankly Do not give your nnmf? 
%ou to be frank, vour regular teacher should be absent from the classroom while Ihfso qtiestionN die 
bomg answered. Neither your teacher" nor anyone else at your school will ever see ymir ao.sttcrs 

The person who is temporarily In charge of vour class will, during this Jjer»od. collect all reports 
and seal them in an envelope addressed to Univcrsi y. Vour J''" 

from the university a summary of the answers by the students in your class. The University will mall 
this summary to no one except your teacher unless requested to do so by your teacher. 

After completing this report, sii quietly or study until all students have completed their reports. 
There shouid be no talking. 

Underline your answer to each question on this page. Write your answers to questions II to 14 on 
the other side of this page. 

WHAT IS YOUR OPINION CONCERNING: 

1. THE KNOWLEDGE THIS TEACHER HAS OF THE SUBJECT TAUGHT? 

■ (Has he a thorough knowledge and understanding of his teaching field?) »^ „ « . 

Below Average Average Good Very Good The Very Best 

2. THE ABILITY OF THIS TEACHER TO EXPLAIN CLEARLY? 
(Are assignments and explanations clear and definite?) 

Below Average Average Good Very Good The Very Best 

3. THIS TEACHER S FAIRNESS IN DEALING WITH STUDENTS? 
(Is he fair and impartial in treatment of ell studento?) 

Below Average Average Good Very Good The Very Best 

4. THE ABILITY OF THIS TEACHER TO MAINTAIN GOOD DISCIPLINE? 
lOoct hp keen good control of the class without being harsh? Is he firm but fair?) 

Below Average Average Good Very Good The Very Best 

5 THE SYMPATHETIC UNDERSTANDING SHOWN BY THIS TEACHER? 
(Is he patient, friendly, considerate, and helpful?) 

Below Average Average Good Very Good The Very Best 

6. HOW MUCH VOU ARE LEARNING TN THIS CLASS? 
(Are you learning well and much? Are you really working?) 

Below Average Average Good Very Good The Very Best 

7 THE ABILITY THIS TEACHER HAS TO MAKE CLASSES INTERESTING? 
(Does he show enthusiasm and a sense of humor? Does he vary teaching procedures?) 

Below Average Average Good Very Good The Very Best 

8 THE ABILITY OF THIS TEACHER TO GET THINGS DONE IN AN EFFICIENT AND BUSINESS 
LIKE MANNER? , , 

(Are plans well made? Is little time wasted?) ^ 
Below Average Average Good Very Good The Very Best 

9 THE SKILL THIS TEACHER HAS TO GET STUDENTS TO THINK FOR THEMSELVES? 

♦Are students- ideas and opinions worth something in this class? Do student help decide how to solve 
problems or6 how to get their work done? Do they get at the real reasons why certain things happen?) 
below Average Average Good Very Good The Very Best 

to THE GENERAL (ALL-ROUND) TEACHING ABILITY OF THIS TEACHER? 
(\\\ (ncio'S cnridered, how close does this teacher come to your Ideal?) 

Be'ow 'vverage Average Good Very Good The Very Bcit 



(over) 
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Figure 8. (Con't.) 
11. PLEASE NAME ONE OR TWO THINGS THAT YOU ESPECIALLY LIKE ABOUT THIS TEACHER. 



18. PLEASE GIVE ONE OR TWO SUOOESTIONS lOR THE IIIPROVIIIENT OP THIS TEACHER. 



18. PLEASE NAME ONE OR TWO THINGS THAT YOU ESPECIALLY LIXB ABOUT THIS COURSE. 



14. PLEASE GIVE ONE OR TWO SUGGE!»7?0NS FOR THE mPROVEMENT OP THIS COURSE. 



Prcptrad by tb« Student ncaetion Cantar. Diviiion of Fi«M StnrlcM, Wtittra Miehifaa VaivMiitty. K a timiw , Mtebiftn. 
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MEAN TOTAL aiANGE SCORES BY FEEDBACK aM)ITION AND 
THEIR COMPARISON BY DUNCAN MULTIPLE RANGE TEST 



Students 


Students and 


Supervisors 


No 


Only 


Supervisors 


Only 


Feedback 


-0.05 


-0.39 


-2.45* 


-1.23* 



♦Significantly different from all other means, p «< .01 (with exception 
of difference between second and fourth means where p <.05). 



MEAN TOTAL CHANGE SCOJffiS BY YEARS OF TEACHING EXPERIENCE 
AND SOURCES OF FEEDBACK (STUDEOT VS. SUPERVISOR) AND 
IHBIR COMPARISON BY DUNCAN MULTIPLE RANGE TEST 



Years of Experience 



Student Feedback 



Supervisor Feedback 



1 - 3 



+0.04 



-1.89* 



4 - 10 



-0.03 



-1.11 



11 or more 



-0.67* 



-1.22 



Mean (all 4 feedback 
conditions) 



-1.11 



-0.76 



-1.17 



"Slgniticantly different from other means ior that teedbacK condition 
(p< .10). 



Figure 6. Changing Teacher Behavior as a Function of Feedback 
Source and Teachers* Experience Level. (From 
Hiclanan and Oliver, 1968.) 
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Figure 7. 
12 RULES OF EFFECTIVE FEEDBACK 

(1) Feedback must involve concrete behaviors or characteristics. 

(2) Feedback must provide clear , incontrovertible evidence of 
exactly how you appear to behavre. 

(3) Feedback source must be reputable and believable and intentions 
accepted. 

(4) Feedback must be in terms you can understand and relate to. 

(5) You^ the feedback recipient must have a clear ideal model of 
vdiat your behaviors or characteristics should be. 

(6) You, the feedback recipient must also know i^t others* 
expectations of you are. 

(7) You must make a ccmnitment as to the way you would like to be* 

(8) You must also make a public commitment to change. 

(9) Feedback must create tension - it must be dissonant with your 
self -perceptions or ideals and it must be internalized* 

(10) Reception of feedback must not involve more than low risk (i.e., 
support should be provided) . 

(11) Models for change and support; for change must be provided. 

(12) Accountability to your group must be maintained through continu- 
ing feedback* 
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FACTOR 1 

CREATIVITY 

Creative-Routinized 

Imaginative-Exacting 

Experimenting-Cautious 

Original-Conventional 

Iconoclastic-Ritualistic 

Uninhibited-Inhibited 

Adventurous -Timid 

Flexible-Dogmatic 

Initiating-Different 



(.84) 
(.83) 

[M] 

(.72) 
(.66) 
(.66) 
(.59) 
(.52) 



FACTOR 2 

DYNAMISM 

Outgoing Withdravm 

Outspoken-Reserved 

Bubbly-Outlet 

Extroverted- Introverted 

Aggressive-Passive 

Assertive-Soft-Spoken 

Dominant-Submissive 

Direct-Subtle 
Buoyant-Lethargic 




FACTOR 3 

ORGANIZED DEMEANOR 

Systematic-Erratic (.83) 

brganized-Disorganized (.76) 

Purposeful-Capricious ( . 74) 

Conscientious-Flighty ( . 71) 

In Control-On The Run (.62) 

Observant-Preoccupied (.58) 

Resourceful -Uncertain ( . 55) 

Sophisticated-Naive (.54) 

Knowledgeable- Shallow ( . 54) 



FACTOR 4 
WARMIH AND ACCEPTANCE 
Warm-Cold 

Sociable-lMfriendly 
Amiable-Hostile 
Patient- Impatient 
Fair-Unfair 
Gentle-Harsh 

Accepting (People) -Critical 
Thoughtful- Inconsiderate 
Polite- Impertinent 



(.76) 
(.74) 
(.79) 
(.69) 
(.67) 
(.65) 
(.64) 



Figure 9. Results of the Factor Analysis of the TTFF. 

(Numbers in parenthese represent factor loadings; 
N « 84 teacher trainees) 
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Figure 10* 
7 SUP mOIBR FEEDBACK SCHEDULE 

(1) Teachers fill out ideal TTFF 

(2) Teachers observe one another and fill out TTFF* 

(3) Bach teacher receives consensus sunmary statement 

(4) Teachers meet as group to discuss feedback 

(5) Teachers engage in strength training 

(6) Teachers apply "Strengths" in regular classes 

(7) Teachers observe one another again and share feedback 



* Student judges may be used in place of teacher judges in this step. 
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Observer 
Date 



lUCXMAN TEACHER FEEDBACK FORM 
FORMA 

On the following pages you will find 50 rating scales similar to 
the one shown below. 

TALL : : : : : i * SiORT 

You are to use all 50 scales to rate the teacher that you are 
observing. If you feel that the adjective tall very accurately describes 
the teacher, place an X in the space next to tall , as shown below. 

TALL X ; ; t : : : : SHORT 

If you feel that the adjective tall is somewhat descriptive of 
the teacher you are abserving, place an jC in the second space; if 
slightly descriptive, place an X in the third space. 

If you feel that the adjective short very accurately describes the 
teacher you are observing, place an X in the space next to short, as 
shown below. 

TALL : : : : i > X ; SHORT 

If you feel that the adjective short is somewhat descriptive, place 
an X in the second to last space; if slightly descriptive, place an X 
in the third space from the right. 

If you feel that either adjective is equally appropriate (or non- 
appropriate) , place an X in the center space. 

Do not place X*s anywhere but in one of the seven spaces provided. 
Make only one X on each scale. Do not leave any blank, do not mark any 
more than once. 

This scale will help a teacher become aware of how others see him 
(her). This form of feedback is essential for self- improvement. Try 
to be both objective and candid. 



Teacher 
Observed 



r %» I 
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1. ORIGINAL 

2. PASSIONAIB 

3. IMPBRTIhENT 

4. PATIENT 

5. COLD 

6. INITIATING 

7. HOSTILE 

8. LIKBABLE 

9. CREATIVE 

10. INHIBITED 

11. INCraJOCLASTIC 



12. 
13. 
14. 
15. 

16. 
17. 

18. 



GENTLE 
UNFAIR 
B0U5fANT 

samxxt 

CAPRICIOUS 
ENERGETIC 
CAUTIOUS 

19. DISORGANIZED 

20. THOUGHrFUL 

21. UNFRIENDLY 

22. RESOURCEFUL 

23. RESERVED 

24. IMAGINATIVE 

25. SUBTLE 



CONVENTIONAL 
OCNTROLLBD 
POLITE 
IMPATIENT 

DEFBRRENT 

AMIABLE 

ALOOF 

ROUTINIZED 

UNINHIBITED 

RITUALISTIC 

HARSH 

FAIR 

LBIHARGIC 

KNOWLEDGEABLE 

PURPOSEFUL 

LIFELESS 

EXPERBefTING 

ORGANIZED 

INCONSIDERATE 

SOCIABLE 

UNCERTAIN 

OUTSPOKEN 

EXACTING 

DIRECT 
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27. 


AGGRESSIVE 


28. 


CONCEITED 


29. 


ACCEPTING 




(people) 


30. 


DETACHED 


31. 


QUIET 


32. 


AirrocRATic 


33. 


CCNTEhiPLATIVE 


34. 


OUTGOING 


35. 


STUBBORN 


36. 


IN CONTROL 


37. 


FLIGHIY 


38. 


DOMINANT 


39. 


mem 


40. 


OBSERVANT 


41. 


EAGER 


42. 


INTROVERTED 


43. 


RELAXED 


44. 


DOGMATIC 


45. 


ASSERTIVE 


46. 


EASY GOING 


47. 


TIMID 


48. 


ANGTOT 


49. 


DOMINEERING 


SO. 


INDIFFERENT 



SYSTEMATIC 

PASSIVE 

HUMBLE 

CRITICAL 

EMPATOIC 

BUBBLY 

DEMOCRATIC 

IMPULSIVE 

WITHDRAWN 

ACCO^MODATING 

ON THE RUN 

CONSCIENTIOUS 

SUBMISSIVE 

CHEERFUL 

PREOCCUPIED 

DISDAINFUL 

EXTROVERTED 

NERVOUS 

FLEXIBLE 

SOFT-SPOKEN 

DEMANDING 

ADVENTUROUS 

HAPIT 

PERMISSIVE 

RESPONSIVE 



Qieck to make sure that you have not left any scale blank, nor have 
marked more than one X on each scaitJ ^ 141 
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Teacher 
Observed 



I ton Scoring 
ORIGINAL 
COLD 



Observer 
Date 



TUCKNIAN TFACIIER FEIiDBACK I'ORM 
FEEDBACK SIJM^IARY SIKET 



6 



icotmrrimi 



I. Creativity 

item item item item item item item 

( 1 + 9 ♦ 11 + 24 ) - ( 10 + 18 + 47 ) + 18 

i * * * ) - ( * * ) + 18 » 

II. Dynamism (dominance § energy) 

item item itan iten\ item item item 

( 27 + 34 + 38 + 45 ) - ( 23 + 31 + 42 ) + 18 

( * * * ) - ( +__+__J +18 « 



III. Organized Dem^^anor (organization § control) 
item item item item item item item 
( 22 + 36 + 40 ) - ( 16 + 19 + 26 + 37 ) + 
i * * ) ' ( * * * ) + 

IV. Warmth and Acceptance 

item item item itan item item item 
( 4 + 12 + 29 ) - ( 5 + 7 + 13 + 21 ) + 
( * * •) - (_+_+_+_J + 
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INSTRl»£NTS FOR STUDENT EVALUATIOI OF FACULTY: IDBAL AND ACTTUAL 

Alan L. Sockloff 
Tenple University 

The subject of ny talk is instruments for student evaluation of 
faculty effectiveness. The reason I find it necessary to remind you of 
the subject is that I expect you to wonder at times whether you are 
listening to another talk from another conference. It is belief that 
the construction of good faculty evaluation instruments involves quite a 
bit more than the gathering of a set of items from another unkncwn, 
urq>roven instrument, putting tliese items together in a single evening, 
and obtaining a dean*s approval. For the construction of any instrunent, 
thex« should exist a sii>stantive philosophy and a scientific methodology 
as a basis. I will discuss some of the ingredients necessary for 
conceptualizing education, as well as some of the methodological problems 
that must be conbatted in the construction of faculty evaluation instruments. 
What are we trying to measure? 

The primary difficulty in the construction of a faculty evaluation 
instrunent stems from the conplexity of such an endeavor. Besides the many 
sources of evaluation and the maipr purposes that can be Served by the 
evaluatioijs , the fact that faculty responsibility covers a broad domain 
suggests that there are also many facets of faculty activity that can be 
evaluated. In a large industry in v^ich the primaxy goal is the dollar, 
the assembly line worker can be evaluated by a foreman via the application 
of a single numeric measure of his productivity, i.e., the nunber 
of windshield wipers attached daily. Can the same be done for a faculty 
menber ixi an institution of hi|^er education? 
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Ihe three major sources of faculty evaluation are administrators, 
faculty colleagues, and students, and the three major purposes of evaluation 
are for making administrative decisions, faculty feedback, and student 
use« Interestingly enough, a parallel exists between the sources and the 
purposes* Does this imply that we can take advantage of the parallel and 
have administrators evaluate faculty for administrative decisions, colleagues 
evaluate faculty for feedback, and students evaluate faculty for student 
use? The limitations of this approach should be obvious . The limited 
perspectives of the three source groups defeat the purposes of the evaluations. 
It is doubtful that in a "natural" environment each of the source groups 
would have the same opportunities to observe the same characteristics and 
behaviors. Ihis is particularly true because the responsibilities of faculty 
menbers are quite complex. 

It can be safely stated that faculty responsibilities are equivalent 
to, and enconpass, the goals of higher education. At a very general level, 
the goal of higher education is education. Without trying to get involved 
intone of those ad nauseum discussions on the meaning of education, I would 
like to present an over-generalized definition that should arouse little 
disagreement. "Education" is defined here as a 'process of change in some 
desired direction, where this direction involves both short-term and long- 
terra objectives . " Althou^ recognizing education as a process , I would 
like to also treat it as a goal. Education necessarily involves two conponents, 
a learner and a stimulator, where the stimulator is a stimulus set external 
to the leamer, e.g., a teacher. The interest here is not in the separate 
components, but rather in the interaction between the components. 

Fortunately, the sifcject of this conference pertains to evaluation 
of faculty effectiveness by students. Thus, concentrating solely xjpon 
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Student evaluation delimits our pwbiem somewhat, but not drastically so. 
to terms of vJiat students can observe and evaluate, a faculty mAer's 
responsibilitiw include course instruction, modelling of an adult and 
citizen, counseling, scholarship, and . j on. I must admit that this 
list covers broad areas, and is not very conprehensive. Nevertheless, 
I am contending that those aspects of faculQr effectiveness tiiat can be 
evaluated by students include more than the Aility to babble nonsense 
(«hich could be read in a textbook) for 10 to 20 hours per week to a group 
of attaitive listeners. 
An extension of some old notions. 

aw of the coniaon notions thrown around these days is that faculty 
effectiveness can be measured on the basis of the so-called measures of 
•learning," -- the adiievement tests. Lefs take a look at one of the bases 
of this notion, an elementaiy school mo&l, and tiy to detendno >*ether 
the generalization of such an approach to faculty evaluation in higher education 
is feasible. 

In the old, traditional sense of an elementary school education, in 
v*ich the desired goal is a grasp of the rudiments, the 3 R's, evaluation of 
teacher effectiveness is quite strai^tforward. To a great extent, the 
short-term objectives in this educational model include the learning of the 
3 R's as a basis for later learning. Assuming random assignment of students 
to teachers, the fact that each daid is subjected mainly to one teadier for 
a full school year allows us to evaluate and con|>are teachers according to 
changes in the scores of their students on annual, coiwn achievemrait 
examinations. 

Now, would su<* an evaluation model fit within the context of hi^er 
education? To answer this question, I propose here to construct the HEX. 
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the Higher Education Instrument, through the conbined efforts of the world's 
great and not-so- great substantive experts and psychometricians. Ihe IlEI 
will consist of many siitests, one for every short-term objective (i.e., 
every conceivable course ever taken in higher education) and one for every 
long-term objective (e.g., problem-solving, intellectual independence, 
enlightened citizenship, emotional maturity, etc.). Ihe HEI could be 
administered during the summer prior to entering a degree program and directly 
after conpleting the otiier program requirements. I suspect that wc would also 
have to set aside weekends during these sunnsrs in order to motivate the 
testees through the administration of pep talks and weekly supplies of pep 
pills. From the HBI results, evexy teacher could be evaluated on the basis 
of pre-post diffei'ence scores on the si>tests of the students th ne tau^t. 

Ihe flaws in such an undertaking should be obvious— the HBI is an 
absurd caricature. Proponents of such an approach, on a somewhat less graadiose 
scale, would have us believe that this is the only valid ^roach. Ihey 
would argue tiiat a scaled-down HEI would allow us to attribute "learning" 
of the students to the teachers. But, if the purpose of such testing is 
the evaluation of faculty effectiveness, then I would argue that there are 
more efficient methods to achieve this purpose. 
An hypothetical model of education. 

Let's assume a hypothetical multi-dimensional space. Ihe axes of this 
space have labels corresponding to the objectives of education such as Kho/ledge, 
IMderstanding, Problem- Solving, Intellectual, Independence, Emotional Maturit/, 
etc. Somevdiere in this space, we have two points, one designated Learner 
and the other designated Education, where both points can be located by distarices 
from the origin along the various axes. The exact location of the point 
Education is not a matter of fact, but a matter of decision on the part of 
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the institution that defines education in terms of somo proportional 
balance of objectives. 

I am tentatively positing a generalized model of Education. According 
to this model, I am contending that an effective faculty member is cne 
in conparison to sone standard, b^ appropriate stimulation propels Leaniers 
closer to the point Education . Learning is the distance moved by a Learner 
directly toward Education. It appears, then, that all we muse do to evaluate 
a faculty member is to measure the Learning of students stimulated by him. 
Herein lies our problem. The model and the points are hypothetical, the 
objectives are hypothetical constructs, and distances propelled along the 
axes corresponding to the various objectives are not directly measurable. 
Since we recognize thut we cannot directly measure either the distances 
propelled along the axes or the distance moved directly toward the point 
Education, then perhaps we can measure other quantitites that are estimates 
of (and correlated with) the distances propelled. But, how can we determine 
that we are acconplishing this in our measures? Or, rather, how can we 
validate our measures? 

Construct validation and measures of faculty effectiveness . 

Construct validation, as espoused by Cronbach and Meehl (1955), arose 
as a method for validating measures in situations in vdiich the classical 
approach to criterion-oriented validation is inappropriate. The logic of 
criterion- oriented validation generally involves the computation of a 
correlation coefficient between the scores of a given test, be it personality, 
attitudinal, interest, achievement, or vAatever, with scores on soma criterion 
measure. Ihe distinction between the measure derived from the given test 
and the criterion is simply a matter of cost: money, time, subject cooperation, 
etc. The criterion is more costly to measure directly, and it is more 
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expedient to liave some estimate of the criterion that can be more easily 
measured. Hcn^ever, v;hen the criterion Itself cannot be measured directly* 
and no single correlation coefficient can be calculated an estimate of 
the validity of some given measure, the classical theorj'' of validity becomes 
inadequate* 

According to the logic of construct va31datlon methodology, there are 
hypothetical constructs that are rot directly measurable and constructs 
that are directly me?»surable. In addition, there exists a nomologlca l 
net that consists of a set of "laws" Interrelating all of the constructs.* 
After the nomologlcal net, or model, has been hypothesized, research is 
used to assess the relationships specified by the model , as well as to 
suggest changes in the model on the basis of enplrlcal evidence. 

An c^proadi that can be used to represent the construct validity method 
was proposed by Campbell and Flske (1959) , the multitridt-multimethod 
matrix. This approach wakes use of two types of validity—convergent and 
discriminant. Whereas convergent validity requires ^at a measure of a 
particular construct be hl^Iy correlated with other, independently obtained 
measures of that construct, discriminant validity requires that the measure 
of that particular construct have lower correlations with measures of other 
constructs. 

Our interests concern the distance moved by the Learner toward the point 
Education (and the distances noved along the axes toward the various objectives) , 
as stimulated by the Teacher. Clearly, we are dealing here with hypothetical 
constructs for which we would like to have accurate measures. let's imagine 
that we constructed a measure of the hypothetical Learning distance, and we 
call this measure Faculty Effectiveness. In terms of the Campbell and Flske 
approach, our measure Faculty Effectiveness has convergent validity if it is 
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hi^ly correlated with other measures of the hypothetical Learning distance, 
and our measure Faculty Effectiveness has discriminant validity if its 
correlations with measures of other hypothetical constructs are quite a 
bit lower. 

A simplified example should help to clarify both aspects of construct 
validity through the multitrait-multimethod matrix. I could have theatrical 
directors sit in with students in several classes for a semester, and have 
both groins of observers rate teachers in terms of Faculty Effectiveness 
and Acting Potential. I could then calculate group means and 6 correlations 
between measures across group means of the different classes, the interest 
of this little study is to help validate the Faculty Effectiveness measure, 
and the insults I would not mind obtaining are the following. I would like 
W hii^t correlations to be between students* and directors* Faculty 
Effectiveness measures (and between students* and directors* Acting Potential 
measures). I would aUo like to find the other correlations substantially 
lower. If, in fact, I found that w highest correlations were between 
students' Faculty Effectiveness and Acting Potential measures, I mif^t have 
to conclude that unless I could find some may of conceptualizing ac' ing as 
a measure of the l^npothetical construct Learning, w Faculty Effectiveness 
measure is doomed and back to the drawing board I would go. 

the point that I want to make here is that tools exist £ot the validation 
of instruMits and their items. Adnittediy, such tools lead to the establishment 
of long-term research i^rograms, but taitil many of the constructs, both 
hypothetical and measurable, can be specified in terms of their interrelation' 
ships, there cannot be a satisfactory instruwut £ot student evaluation of 
faculty effectiveness. Simply stated, the ''ideal** in^^tnmient consists of 
measures that are valid with respect to the construct ' teaming as §timUteA 
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by the Teacher*" 

Proponents of the lEI approach insist that when measures of student 
evaluation of teaching either fail to show significant relationships with 
achievement tests scores or show significant relationships in the opposite 
direction (i.e., students assigning high ratings to the teacher are those 
who received low scores on achievement tests) , that this invalidates the use 
of student ratings. But, ^en neither measure has been validated against 
the hypothetical Learning construct, such results really show nothing. 
All too often, the means are confused with the end-products, and associated 
with this erroneous reasoning is the belief that an achievement test score 
is itself the hypothetical construct Learning. A positive feature of the 
achievement test approach is that it may well lead to reasonable estimates 
of a learning construct, without suffering too severely from response bieises 
that are so typical of rating instruments. But, surely, the objectives 
defining Education are not likely to all be I&iowledge -related. Furthermore, 
the standardization of evaluation procedures brought about by student ratings 
is a desirable feature. A single rating instrument, with items of proven 
validity, can be more conveniently administered than achievement tests and 
would allow con^arisons across courses, departments, or colleges. 

The question of the student's ability to evaluate faculty is often 
raised: Who are students to judge? Hiis is a fair question because on one 
hand we are asking the student to go through the process of education, and on 
the other hand we are asking the student to objectively judge either his 
own educational progress or the characteristics of his teachers that lead 
to his educational progress. The answer is simple. If the characteristics 
of a good teacher can be defined and validated with respect to the construct 
Learning, and are observable and accurately rateable, then student evaluation 
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can be a practicable solution for the various purposes of evaluating faculty 
effectiveness* 

Some principles in constructing rating scales . 

The preceding discussion of construct validity and its relationship 
to student evaluation of feculty effectiveness presiipposed both the construction 
of the measures and satisfactorily high reliability of the measures. I will 
briefly review the logic and considerations involved in the construction 
of rating items, and tliis will be followed by a rather brief note on reliability. 
I am avoiding any mention of open-ended questions, since the major \4se I see 
for this type of question is for faculty feedback and self-diagnosis. 

The most common technique used in faculty evaluation instruments involves 
rating scales. Although not necessarily in terms of item format, but in texms 
of purpose, an inportant distinction exists between rating scales and attitude 
scales. The purpose behind a rating scale is to objectively describe some 
external object, v^ereas the purpose behind an attitude scale is to subjectively 
describe one's reactions to, or attitude toward, that external object. With 
regard to faculty evaluation instruments, this distinction is sometimes 
clouded. I do think that we should be more interested in rating the teacher 
than in msasuring students' attitudes toward that teacher. The reasoning 
behind this is that objective ratings of behaviors and characteristics of 
the teacher should have a smaller, more "controllable* set of biases than 
students' subjective attitudes toward that teacher. 

For the most part, two types of rating items have been used in faculty 
evaluation instrunents: nuneric rati^igs and graphic ratings. Both item 
types involve a stem, which is a statement regarding a characteristic of the 
teacher, and a series of cues, which are ordered adjective and/or adverb 
phrases or words. For the numeric rating item, nunbers are frequently 
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aissociated with the cues, and the respondent is asked to mark one response 
on the questionnaire or to record the nuniber associated with the cue on a 
scorable answer sheet or punch card. For the graphic rating item, there 
exists a response continuum, which is usually a segmented or continuous 
line (more likely horizontal) , with cues to identify regions along that line. 
The respondent is asked to mark some point on that line. Although graphic 
rating items are more easily administered, numeric rating items are more 
easily scored. 

According to Guilford (1954), the following are some guidelines for 
the construction of stems. The stems should describe traits, qualities, 
or behaviors that: 

(1) are objective and specific, 

(2) are not a conpjsite of independent traits, qualities, or behaviors 

(3) refer to a single type of activity or its results, 

(4) are judged on the basis of present or past performance, not 
on future promise. 

In addition, 

(5) stems should not contain cues. 

Furthermore, according to Guilford (1954) , the following are 
guidelines for tJie construction of cues. Cues should: 

(1) be short and unambiguous, 

(2) be consistent with the stem and other cues for that stem, 

(3) have a precise, short range, 

(4) have varied language with respect to a single stem, 

(5) avoid ethical, moral, or social evaluations, 

(6) not be similar across stems (i.e., non-common sets of cues). 

In constructing responses for a numeric rating item, there are additional 
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considerations regarding the nunber of responses in the scale and character- 
istics of tlie cues. Hie nunher of responses should be such that the respondent 
can discriminate the gradations between responses with the aid of the cues. 
For statistical reasons (minimizing platykurtosis and skew) , it is preferable 
to have the cue anchors (the two most extreme cues) sufficiently extreme 
that they will draw few responses; thus, if there are k possible responses, 
there would be k-2 functional (most used) responses. Ihe 5-point scale 
item is fairly popular, and if the anchor responses were designed to be 
used rarely, this would leave only 3 functional responses. In this case, 
the amount of lost information from a functional 3- response scale depends 
on the extent to which the respondent could have made finer discriminations. 
In many of the faculty evaluation instruments that I have seen, there is 
a built-in functional asymnetry insofar as the negative anchor cue is 
quite extreme and has little drawing power, while the positive anchor cue 
is not so extreme and has a stronger drawing pcwer, thus skewing the item 
response distributions. 

It may be desirable for the responses to be stfcjectively equidistant, 
but this should not be done at the expense of truncating the range of 
functional responses. If reasonably equal sil)jective response intervals are 
desirable, it may be necessary to cue all of the responses, not just the anchor 
responses. Another good reason for trying to cue all of the responses is that 
the lack of cues may arouse anbiguity, which can lead to the operation of 
response biases in the functional range of the 9cale. And, last, for 
statistical reasons, the choice of cues should be dictated by efforts to have 
the mean response across instructors centered' in the middle of the scale. 

The real bugri)oo in rating scales is the operation of response biases 
or errors. If care isn't taken in writing items and training raters, responses 
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may contain little more than the affects of bias. According to Guilford 
(1954), there are at least 6 general categories of response bias: logical, 
proxijnity, central tendency, leniency, contrast, and halo. Although all 
of these response biases can be attributed to personal idiosyncrasies of 
the raters, the first three categories can be considered non- interpersonal. 
Logical bias is the tendency to give similar ratings on itens that look 
similar. Proximity bias is the tendency to give similar ratings on neighboring 
items in the instrument. Central tendency bias is the tendency to give central 
ratings rather than extreme ratings. 

The remaining three categories of response bias may operate when other 
people are being rated. Leniency bias is simply a characteristic of people- 
as-raters—some people are just more lenient than others. Contrast bias is 
the tendenqr to rate other people as being opposite from oneself. Halo 
bias, pexhaps the most serious of biases in inter-personal ratings, represents 
a generalization of an overall siijjective feeling toward the person being 
rated to the rating of specific qualities of that person. 

Of the many faculty evaluation instruments and individual faculty suranaries 
that I have seen, I think that the operation of the central tendency and contrast 
biases are, if not minimal, far outweighed by the effects of the leniency, 
logical, and halo biases. I think that most students are unwilling to be overly 
critical of their teachers , and this may be due in part to their suspicions 
about the anonymity of their responses. Further, the use of poor, relatively 
global-type items seems to almost demand personal response bias rather than 
objectivity. For this reason, I suspect that a good actor \Aio assigns high 
grades and stimulates little in the way of Learning can fare pretty well on 
instruments consisting of items that violate most of the guidelines. 

The operation of response biases are particularly problematic when it comes 
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time to assessthe reliability of an instrument. By reliability, I mean 
stability or consistency of measurement. The appeal of highly reliable 
instruments and the efforts expended toward the development of reliable 
instruments stems from the psychometric axiom that reliability sets a 
ceiling on validity. The temptation of the researcher who is both aware 
of the axiom and aware of the difficulties in assessing validity may be 
the following: "Well,. ..things can't be that bad if my reliabilities 
are so high." The problem* is samply that reliability is a necessary 
condition, but not a sufficient condition, for validity. 

Since the interpersonal response biases (leniency, contrast, and halo) 
can be thought of as relatively enduring traits of the raters with respect 
to the rating of a particular teacher, the variance attributable to these 
response biases is included wich the variance attributable to true scores, 
thus exaggerating reliability estimates. Given a set of poor, anuiguous, 
glcbttl items, with iibsolutely no validity with respect to Learning, I am 
certain that I could provide you with reliability coefficients in the 
.80 *s or even the .90 's. Until it can be demonstrated that response bias 
has been minimized or statistically controlled, we are wasting our time 
calculating reliability coefficients. 
Some issues in item, and instrument construction . 

A great deal of latitude exists in the methods for constructing 
faculty evaluation instruments. Considering this latitude, it is not 
very surprising that different researchers achieve different, and sometimes 
contrary, results in research relating student ratings of faculty effec- 
tiveness to other measures. Until such time that the "ideal" instnanent 
is developed, some of the research differences will just have to be tolerated 
and tentatively attributed to instrument differences or sample differences. 
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There are, however, a few issues relating to instrument construction that 
are, if not resolvable, at least worth discussing at this time. The tack 
that I shall tike in the discussion of these issues is based on my conception 
of the principles of comnon sense in conjunction with the psychometric 
properties of good items. Discussion of these issues is critical if the 
Instruments we construct are to ever withstand the rigors of validity 
testing. The following issues will be discussed briefly: the selection 
of potentially valid items; behavioral vs. global items; the use of 
composite scales; the use of comnon cues; the choice of response continuum; 
the use of noxmative data; the rated object; and traditional vs. progressive 
items. 

The selection of potentially valid items is ideally done through a 
model of learning in higher education. Since there aren't too many models 
being kicked around these days, some other selection methods are needed. 
Critical incident techniques and open-ended requests for traits seem to be 
fairly successful methods for gathering items. The most popular method of 
selecting items is the "prestige library" method, the borrowing of items 
from popular, prestigious instruments. A very necessary, but often 
overlooked, step for items that have been selected througn means other than 
a model is that of "ORA?* consensual validation: observability by the target 
source group; rateability ^^y the target source group; and acceptability by 
other source groups as meia^res that are potentially related to a Learning 
construct. The CRA consensual validation should give a comfortable heads tart 
on eventual construct validation. 

The behavioral vs. global item issue concerns the complexity of the 
behaviors rated. The following two stems are typical of the two extremes: 
"The teacher made use of illustrations to get across difficult points"; and 
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*^erall, the teacher was..." For several reasons, mostly having to do 
with the properties of global items, I believe the behavioral items to be 
superior. First, global items are non-specific and somewhat aiii>iguous, 
thus leading to the operation of response bias, particularly halo bias. 
Second, in general, there is a little reason to have great faith in the 
reliability, and validity, of single items. Decision-makers (students 
- deciding which courses to take and administrators deciding who to promote) , 
when viewing the faculty evaliaation results of individual teachers, tend 
to search out one or two "comprehensive" items as a basis for their decision. 
The fact that these one or two "comprehensive" items are global items, and 
are overly subject to response bias, suggests that decisions regarding a 
professor's perfoimance are more likely based on the extent to which he is 
liked, not necessarily the extent to which he is a good teacher. Ihird, 
global items have a little diagnostic value, and fourth, behavioral items 
fall within tlie realm of objectively observable. 

Besides the diagnostic value and the better capabilities of minimizing 
response bias , an additional positive feature of behavioral items involves 
their potential use in composite scales. On the basis of factor analytic, 
clustering, or even rational, techniques, various groupings of items can be 
summated to create composite scales. Assuming that the behavioral items 
are good items, the inherent advantages of scale scores include high 
reliability (and potential validity), as well as comprehensiveness. 

The use of common cues with corawon scale directions has some interesting 
ramifications. Considering that students may evaluate several faculty in any 
given semester, a long set of itons with cues unique to each item may lead 
to boredom, fatigue, and eventually large doses of response bias. One 
alternative is to use a short set of behavioral iiems with unique cues, but 
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the set of itens itself could not be very comprehensive. Another al tentative 
is to use a short set of global items with unique cues. And still, another 
alternative is to use a large set of behavioral items , making use of conroon 
cues with conroon scale directions to facilitate the administration of the 
instrument. But, this might not be a good idea eitlier, since the comnon 
cues m^ lead to the non-use of cues by the raters and the operation of response 
bias. Since none of the alternatives are satisfactory, we will have to await 
methodological research considering these questions. 

Those who dare fate by using coiinnn cues sometimes do so for reasons 
of ejipediency, e.g., the restricted area on optical scan sheets sometimes 
forces instninent constructors to use conroon cues if the stems, cues, and 
response areas are to be on a single sheet. If conroon cues are to be used, 
what are the appropriate continua underlying these cues? The following 
exanples of underlying cue continua also include ny perceptions of their 
lindtations in texms of introducing response bias. The "agreement-disagreement" 
continuum suggests the subjectivity of attitudes rather than the objectivity 
of rating. Other continua, such ss a "success" continuun, are hi^ly 
value-laden and may well lead to the same result. At first blush, the 
"frequency" continuum appears to have soroe nice objective properties, but 
since it may be difficult to fit every stem to ratings in terns of ranges 
of frequency of occurrence, this too may lead to response bias, particularly 
for stems that do not comfortably fit the cues. Various other continua, 
such as a "characteristic-uncharacteristic" continuum, may turn out to be 
anbiguous and, thus, ignored for cuing responses. I think that what we 
have here is another open area for methodological research. 

With regard to use of normative data, I fail to see any issue— noxmative 
data is an absolute necessity. Since the nuneric values of item ratings 
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are completely arbitrary, with no zero point, and since the assignment of 
nianeric ratings is very much influenced bv the particular set of cues for 
each item, the only meaning that can be attached to item mean scores or 
frequencies or to composite scale mean scores derives from comparison with 
some kind of standard, such as a normative group. In addition, flexibility 
can be gained by using different types of normative data, such as college 
nonns, department norms, student class norms, or even individual faculty 
nouns. 

Student evaluation of faculty effectiveness has been made through the 
rating of three objects: the teacher, the course, or the students' own 
educational development . Ideally, effective faculty offer good courses iJi 
which students learn. But, if the ratings of the three objects do not jibe, 
what does this mean? Hartley and Hogan (1972) factor analyzed teacher-course 
description ratings adapted from McKeachie's form along with the student's 
ratings of their own self -development. Hartley and Hogan's results revealed 
factors that were defined by either teacher-course descriptions or by self- 
developraent items, but generally not by both types of items. These results 
raise an interesting issue. Although the self -development approach would 
seem to be a good method for ridding response biases with respect to the 
teacher (and course) , it may provide little more than a vehicle for the 
operation of a con5)letely different set of response biases, self-perception 
response biases. Unquestionably, this is another one of those issues that 
is in need of clarification from research. 

The last issue I would like to tackle is that of traditional vs. 
progressive items. Items obtained by the "prestige library" method tend 
to be traditional items. By traditional items, I mean items that are 
generally appropriate for most varieties of teaching situations. Progressive 
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items are those that allow for differing, non-standard types of behaviors and 
approaches that may be effective as teaching and learning techniques. Since 
not all teachers experiment with teaching devices and techniques, with regard 
to instrument construction, there should be some concern that progressive 
teachers would not receive 1cm ratings on items that represent traditional 
behaviors which were replaced by that teacher. For exanple, the teacher who 
found that students in his courses gave more creative responses on examinations 
if he did not tell them how to study for the course would not fare well on the 
following item used with "frequency" cues: "The teacher gave advice on how 
to study for the course." A not-so-pleasant alternative approach to avoid 
this problem would be the use of global items . 
A jaundiced view of vtfiat people do 

Dick Riley and I were curious about how people actually constructed 
and used instruments for faculty evaluation by students. Althou^ some of 
the requests were lost in the mail, we wrote just under 3,000 i\merican 
institutions of hi^er education, requesting information about use^; financial 
support, sources of items, and methods, as well as copies of instruments, 
exemplary Individual sumnaiy sheets, and technical reports describing the 
construction of the instruments. Our questionnaires were sent to the highest 
ranking academic administrator whom we thou^t would be concerned with student 
evaluation of faculty. We have received around 900 responses- -this hisJier- 
than-e:q)ected return rate may have resulted froii our promise to send copies 
of our report to the returnees. 

Our responses came from a variety of institutions with respect to 
type (university, 4-ycar college, 2-year college, technical schools, post- 
graduate), size, sex (single sex, coeducational), and control (private, 
ptfclic) . Around 500 instruments were sent. In addition to a somewhat smaller 
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nunber of exemplary individual teadier summaries, the nunter of technical 
reports describing construction and research on the instruments was qu^ small. 
Being optimists, we just assumed that someone forgot to send along the reports. 
On the other hand, we'll never knew whether the technical reports that we 
failed to receive actually exist, mless we were willing to go through some 
extensive follow-vp procedures. 

The large majority of the instruments that we received were used primarily 
for faculty feedback purposes, and to a lesser extent, for administrative 
decisions and student perusal. It was, hc3wever, mildly disturbing to learn 
that in more than one-half of the cases, the individual faculty summaries 
were seen oy decision makers ( administrators, department chairmen, and 
students) . 

The modal, typical instrument contained between 11 and 30 items, 
largely derived from other instruments— some by the '•prestige library" method 
and others by the "not-so-prestige library" method. The instruments typically 
contained professor items, course items, global items, and open-ended questions. 
Student development items were used, but did not seem to have the popularity 
of professor iteins. I4ost of the instruments were mimeographed, with responses 
to be marked on the instrument itself. Norms were used in conjunction with 
around 101 of the instruments. 

With a few exceptions, my own undocumented, global rating of the 
instruments would be the negative anchor on a 5 -point scale. Item stems 
contained statements about many unobser»/dble characteristics of faculty and 
courses or characteristics that should not be evaluated by college students. 
In addition, the conbination of emotionally loaded stems and cues that are 
suggestive of attitudinal or evaluative judgments seemed to ask for responses 
that contain little more than bias. As far as I could see, rare is the 
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instrument that hasn't violated at least one of the principles of good item 
writing. Perhaps , if the instruments had been researched early in their 
development, a good 991 would not be around today. 

Much of the blame for these conditions should be placed on the colleges 
and universities themselves. Although ackncwledging the need for student 
input in the decision-making, tl:.ese institutions have certainly tolerated, but 
not encouraged, student evaluation of faculty. The reasoning seems to be as 
follows: If everyone can agree on the inferiorivy of the bulk of the available 
instruments , then no one really has to take *hem seriously. 

In conclusion, I have tried to show that the methodology and technology 
are available for the construction of instruments. Even thou^, by definition, 
the "ideal" instrument may never be constructed, the process of striving for 
this goal should lead to vast inprovements over the status quo. 
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THE KANSAS STATE UNIVERSITY PR0C3RAM 
FOR ASSESSING AND IMPROVING INSTRUCTIONAL EFFECTIVENESS 

Doiiald P. Hoyt 
Kansas State Universtiy 

Although iny purpose is to describe the ceacher evaluation program in 
operation at Kansas State University, a brief review of its history is 
necessary both to set the climate and to provide a rationale. The program 
had its inception in October, 1968, during my second month as Kansas 
State's Director of Educational Research. I was not as naive politically 
as that may sound; while I was fiilly aware that faculty evaluation would 
inevitably become a central concem of our Office, I intended to spend 
my first year or two on less controversial and threatening problems. 
I felt a good program required the trust of the faculty, and gaining that 
trust would take time. 

This assessment, while absolutely correct, became less persusaive as 
a deterrent when, over the course of two weeks, contingents representing 
faculty-student conmittees on instruction in three different colleges 
soug^it my advice on developing tfieir own devices for appraising instructional 
effectiveness. Given the alternative of having multiple amateurish efforts 
whose quality would be questionable and whose administrative procedures 
would be chaotic, I concluded that the potential dangers would be less 
if I made a serious (though premature) atten55t at appraising teaching 
effectiveness. With an interpersonal touch they don't teach in graduate 
school, I successfully inveigled each conmittee into requesting that I design 
a system that would meet the needs of all three. 

The first problem which had to be resolved concerned purpose. There 
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was universal agreement that "improving teaching effectiveness'* should be 
a major tlii-ust. In addition, there was considerable sentiment among 
students that the program should also produce results vdaich would help 
them select courses and instructors. Students also tended to be sympathetic 
to my bias that results should be used vdien making decisions related to 
salary, promotion, and tenure. A violent faculty reaction to both of 
these ideas made it clear that the only premise under which we could proceed 
was ^that the sole purpose of the program would be the improvement of 
teachin^.-x. I helped convince the student representatives to accept this 
not only because it was worthy but also because I felt that, if tJie 
program succeeded in this way, progress on the other purposes might become 
feasible in later years. As it turned out, this expectation was realized. 

A number of faculty members and students served as consultants. The 
faculty were particularly helpful. I began by presenting them with lists 
of items describing teaching behavior, stolen from various sources. I 
asked them to indicate which of the items were especially descriptive of 
good teaching. While most of the faculty consultants were courteous and 
made constructive ccranents, two or three of the most hostl'e ones had 
the most positive effect on my thinking. One went to considerable tixjiible 
to show how each of the items he was reviewing could be symptomatic of 
inferior as well as superior teaching (e.g., "The teacher vdio lets students 
discuss the fact that 2 plus 2 equals 4 wastes his time and that of his 
students;" "Well-organized garbage still smells; and disorganized pearls 
are still precious;" "Lovin' ♦era don't leam »em; the price of your 
popularity is their ignorance. '0 Another astutely pointed out that any 
attenpt to describe the ideal teacher by a standard set of items was doomed 
to failure because what was effective was dependent on the situation. 
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I felt a little foolish to have the obvious pointed out so clearly- 
techniques that work well with large classes don^t necessarily work with 
small classes; and faculty menibers who are trying to get across solid 
factual content may have to use methods quite different from those who 
are trying to stimulate students to examine their motives or values. 

These critiques led to the most important decision in designing our 
program and to the feature which most distinguishes it from others I have 
examined. I refer to the decision as to how teaching effectiveness 
should be defined. I could see no way to define it by describing any 
single role model. Rather, my most persuasive critics were saying, 
indirectly, that good teaching is recognized by its products. Examine 
what happened to the students and you* 11 know if the faculty member was 
effective or not. 

When I asked my consultants to respond to that reasoning, I found 
no serious objections. What they did say was that there was no way to 
design a system based on this logic because the outcomes expected in 
each course would be different. Clearly, the effectiveness with which 
music appreciation was taught would require different measures of student 
outcomes than would be needed for a course in theiroodynamics. 

While I recognized the difficulties, earlier experiences convinced 
me that they may not be insurmountable. So a new tack was taken. Using 
the taxonomies (Bloom, 1956; Krathwohl, et. al., 1956) and some stimulating 
work by Deshpande and Webb (1968) , I tried my hand at developing a list 
of general objectives which could be used to describe the purpose of any 
course. After several coranittee meetings and considerable debate, I was 
left with a list of eight objectives which seemed to do reasonable justice 
to the literature and to the suggestions of my consultants. The latter 



agreed that, by supplying importance ratings to these objectives, faculty 
menibers could provide a profile of their objectives which would adequately 
describe their courses. 

Now all we had to do was measure progress on these objectives. If 
we had reasonably adequate measures of student progress, we could combine 
them with the instructor's rating of importance to obtain an evaluation 
of teaching effectiveness \Adch took into account the unique pattern of 
objectives for each course. I recalled some earlier personal experiences 
in the development of empirical measuring devices which had been the source 
of some embarrassment. For example, after spending several thousand dollars 
of the Hill Family Foundation's money to measure anxiety, the most potent 
item our research uncovered was "I feel anxious about someone or something 
almost all the time:" And in developing an alcoholism scale, our best 
item was "I have used alcohol excessively in the past." These experi^mces 
encouraged me to try the simplest, most direct approach; namely, to ask 
the student how much progress he made. I had been involved with and knew 
of a number of studies vdiich suggested the value of self -ratings (e.g. 
Holland § Lutz, 1968; Keefer, 1965; Walsh, 1967). And Nate Gage provided 
an inadvertent boost to my confidence when, a year earlier, ho told a 
seminar I was teaching that student ratings of their knowledge gain 
correlated substantially with objectively measured gain in some of the 
mini -unit studies he was conduc' ^,ng at Stanford. I finally found a 
study by Soloman, Rosenberg, and Bezdek (1968) that reported findings which 
strongly supported my bias. This was enough intellectual armament to 
win approval from my consultants for a trial. 

The rest of the technical history is quite routine. I devised an 
instrument vthich allowed students to rate their progress on the eight 
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objectives. It also contained a mniber of items, mostly plagiarized 
(Isaacson, et. al., 1964; Soloman, 1966; Whitlock, 1966), to describe the 
instructor's behavior and the course. A separate foim was used to collect 
instructor ratings of the importance of each objective. After intensive 
politicking, most faculty members 'Volunteered'* to participate in the 
developmental run which was conducted in the second semester of 1968-69. 

With results from well over 700 classes, there was plenty of data to 
divide the classes into "developmental" and "cross-validation" groups. 
Sixteen partially overlapping developmental subgroups were formed by 
i-orting classes into two sizes (50 or more students and less than 30) 
and into one or more groups based upon instructor objectives. For 
example, one subgroup contained all classes enrolling fewer than 30 
students where the instructor had rated the objective concerned with 
gaining factual knowledge as "essential". Similar subgroups were fonnrd 
for large and small classes stressing each of the seven remaining objectives. 
Qasses within a subgroup were then assigned to one of six "progress" 
categories on the basis of the average rating of student progress on the 
objective in question. Then, statistical analyses were performed to 
determine how descriptions of instructors nitiose classes made 'taich progress" 
differed from those v^ere student progress ratings were low. Resulting 
scales were then cross-validated. 

In the end, a few items were found to be characteristics of effective 
teaching regardless of the objectives being sought or the size of the class. 
A few other were predictive of progress ratings in small classes but w«?re 
unrelated to progress in large classes; the reverse was also true. And 
a few items didn't differentiate among progress groups on any criterion. 
But for the most part, the contention of my early critics was substantiated. 
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The particular teaching behaviors which were related to student progress 
were different for each objective and for large and small classes. Rather 
than one model of teaching effectiveness, we had developed sixteen. 

Other statistics were also encouraging. In cross-validation studies, 
the 16 special scales correlated from .SO to .83 with class progress ratings 
with average values of .68 and .62 for large and small classes, respectively. 
Reliability figures were generally over .90 for classes of IS or more 
students. Costs averaged about $3.00 per class, and these were covered by 
the U. S. Office of Education which had generously funded the effort (Hoyt, 
1969). 

We soon discovered that conducting a good study doesn't guarantee the 
implementation of a good program, request for supplemental funds to 
establish a service program based on our research was referred to the 
Faculty Senate vdiich expressed the sentiment that the University had more 
pressing needs. It was finally agreed that faculty members \Aio requested 
the use of our device could be accomnodated if the dean of his college 
would pay the computing center costs. Hiis procedure nearly resulted in 
the stillbirth of the program; fortunately one dean not only provided 
blanket authorization for his faculty but made it clear that he believed 
volunteering to participate was a positive thing to do. This kept the 
program alive, but at a very reduced level. Results for approximately 
80 classes were processed in 1969-70. 

By happy accident, the Council of Academic Deans had voted a year 
earlier to establish a new office of faculty development. When it became 
apparent that enrollment increases would merit a number of new positions 
in 1970-71, a search committee was activated. As luck would have it, a 
popular member of our College of Education who had served as a teaching 
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consultant to many departments expressed an .interest in the position. 
His appointment in the fall of 1970-71 marked the end of vigorous resistance 
to the program. As best I can interpret the situation, the changed 
atmosphere was due partly to the one year cooling-off period, partly to 
the idea that the teaching improvement program would incliKle both appraisal 
and expert consultation, and partly to the highly positive image of the 
faculty member appointed. 

In any event, through his office the program has been offered on a 
volunteer, confidential basis every semester since. The iuonber of partici- 
pants has steadily grown from about 250 classes in the fall of 1970 to 
over 400 last fall. While instructional improvement has remained the 
program's major thrust, over 90 percent of the participants last fall 
agreed to release selected parts of the report for publication by a Student 
Senate committee. And by recent action of the Faculty Senate, results 
from this or similar devices must be made available to the department head 
before reappointment decisions are made. In addition, the instrument plays 
a major role in selecting winners of the outstanding teacher cash awards. 
Ihree years after its traumatic birth, the program is thriving and finding 
broad application on our campus. 

Teaching improvement is the major purpose of the program. In a recent 
study of changes of scores over one and two year periods, there is the 
suggestion that the program has enjoyed at least minimal success . Retest 
scores for the same course- instructor combinations showed significant gains 
on both student progress ratings and on a nunber of instructional methods 
scores. 

Let me describe more specifirally how the evaluation is used to improve 
instnK:tion. You will recall that the basic research effoit resulted in 
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identifying 16 lists of items describing the classroom behaviors characteristic 
of the most successful teachers (i.e., those whose students reported th© 
most progress on a given objective) . These lists of relevant items have 
become the focal point of our efforts to improve instructional procedures. 

Typically, the process goes like this. A computer report and inter- 
pretative manual is sent to the participating faculty member. He reacts 
with confusion, diasppointment, or curiosity and accepts our invitation to 
attend a group interpretation meeting. There he is asked to identify the 
areas of greatest concern by comparing his importance ratings with student 
progress ratings; appropriate noims are used and most instructors attending 
these sessions find at least one important objective ^ere student progress 
ratings were below average. When such an objective has been identified, 
the faculty member is asked to review the particular teacher behavior items 
which were positively related to gains on this objective. He is shown how 
most teachers are rated on these items and his printout shows how he was 
rated. Invariably, his rating will be unusually low on a few of these crucial 
items. Presumably, these items will form the basis for his self- improvement 
efforts. 

What happens then is not higlily predictable. Some seem to resolve to 
do better and let it go at that. Others may arrange to attend one of the 
special seminars on teaching procedures conducted by the Director of 
Educational Improvement. Still others make individual appointments with 
the Director and embark on individual improvement programs of various degrees 
of intensity. Figure 1 shows the "Before" and "After" results for one 
faculty member v^o embarked on a serious, time consuming self -improvement 
program under the supervision of the Direcjog.g 

Although the program is alive and well, lie instrument was thoroughly 
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revised in the fall of 1972-73. The revision reflects both the criticisms 
of experienced users and re-thinking on our part of how our improvement 
purposes could be best served. Let me give you an indication of the changes 
we felt were required without boring you with details. 

1. The list of objectives was revised and expanded. As users became 
more acquainted with the key role objectives play in our evaluation process, 
they became more articulate about their purposes and about inadequacies in 
our original list of eight. The revised foim includes 10 objectives. 

2. While student progress on relevant objectives continues to be 
our criterion of success, the revision more explicitly recognizes that 
such progress may be a function of the students as well as the teacher. 

' Therefore, a number of items relating to student motivation and expectations 
have been included and will be examined for their relevance to student 
progress. These itons may help us adjust for the advantage which courses 
enrolling motivated majors have always had over general education courses. 

3. Altering classrocm behaviors is one way to induce more student 
progress; another may be to plan the course more wisely. A set of itans 
on the revision is directed to the latter strateg)' by inquiring about 
course demands, content, and reading assignments. 

4. To satisfy faculty members and student alike, we dropped the 
"true-false" foimat in favor of five-response altemptives throughout. 
In the course of doing this, non- functioning itons ^^se unrelated to 
any kind of progress ratings) were eliminated. 

5. By using a new input procedure, we have reduced processing costs 
to ail average of $1.50 per class plus 1 cent per student. 

We're confident the changes will make the program more valuable to 
its users. 167 
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(Xir experience suggests that several ingredients are needed to develop 
a successful program for appraising and improving instructional performance, 
A sound rationale which can respond meaningfully to well-intentioned 
concerns and objections of faculty members is essential. The rationale needs 
solid statistical and research support. Delicate political problems must 
be faced and resolved with sensitivity, patience , and a willingness to 
compromise. A smoothly functioning administrative process is essential so 
that needed materials show up at the proper time and place, no results get 
lost, reports are made in a reasonable length of time, and continuity in 
service is assured. Finally, it is necessary to demonstrate sincerity of 
motive by providing assistance in interpreting diagnostic reports and 
responding constructively to the shortcomings they identify. 
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Smm EVALUATION OF INSTRUCTION AT MIQIIGAN STATE UNIVERSITY 

Willard G. Warrington 
Michigan State University 

It is ray intention to follow the case study approach in my presentation, 
a sort of "show and tell." I want to discuss in some detail the Student 
Instructional Rating System (SIRS) that has been in operation on the 
campus of Michigan State University since 1969. The rating form for this 
system was empirically developed over a two year period, was accepted as a 
part of the academic program by the faculty and has now been administered 
in over 10,000 classes to more than 400,000 students. 

It is not my intention to argue that our SIRS is the best system in 
operation anyvdiere or even that it is an outstanding system but rather to 
report some of its characteristics and some of the consequences of its 
widespread use during the past three years. 

First, let me put SIRS into some historical perspective. MSU, like 
many institutions , has for many years encouraged its faculty members to 
utilize student feedback in analyzing and evaluating classroom instruction. 
A series of locally developed rating forms were made available but these 
varied considerably in quality and their use, at best, was relatively 
infrequent and unsystematic. 

Consequently, in 1967, a specific project was funded under the MSU 
Educational Development Program (EDP) to undertake the systematic develop- 
ment of a conprehensive student instructional evaluation system which would 
provide faculty members with student reactions to their teaching. This 
project was under the direction of Dr. F. Craig Johnson, Assistant Director 
of EDP, (now a professor of Institutional Research at Florida State University) 
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with most of the actual development work being carried out by two doctoral 
students in Psychology, W«'llace Berger and Stanley Cohen. 

In view o£ certain developmtats which I will discuss later in this 
paper, it is important to remember that the initial objective of the SIRS 
project was to develop procedures for allowing an instructor to collect 
student feedback data that he could utilize for self-examination and self- 
improvement of his instruction. 

In the early stages of this project two important decisions were 
made that had much to do with the final characteristics of SIRS. First, 
it was decided to heavily involve both students and faculty in the deteimina- 
tion of the content of the rating fom that was to be developed and, second, 
it was agreed that the completed system would provide normative data so 
that faculty members could determine their standing relative to other 
faculty teaching similar courses. 

No effort will be made to describe in complete detail the actual 
steps in the two years of the development of SIRS. (For those interested, 
a fifty page, technical bulletin is available through the MSU Office of 
Evaluation Services.) However, some broad overview is necessary to 
understand the syst«n that finally emerged. 

Briefly, the SIRS project proceeded as follows: Students and faculty 
in a wide range of courses were interviewed in the Sunnier of 1967 using the 
"critical iitcident" approach. Faculty were asked to "compare and contrast 
your best and worst students." Students were asked to "compare and coitrast 
your best and worst instructors." All interviews were content analyzed 
resulting in 1300 key phrases and sentences which were rewritten in an 
item format suitable for an evaluation form. Items from existing student 
instructioiial rating forms were also collected. After much editing and 
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elimination of duplication, 223 experimental items were grouped into six 
parallel forms for administration during the 1967, Fall Term. 

A rather elaborate stratified sampling procedure identified 1,286 
students and 594 faculty members who were asked to react to one of the 
parallel foims by responding to four questions about each item. These ques- 
tions represented evaluative dimensions which emerged from the initial 
interviewing. The faculty was asked: 

1. Does this item present information which you could use for course 
improvement? (yes/no) 

2. If you were to construct a student course appraisal sheet would 
you include this item? (yes/no) 

3. Would you need additional information to interpret the responses 
to this item? (yes/no) 

4. Do you believe that students have enough infoimation and/or are 
competent to accurately respond to this item? (yes/no) 

Students answered the following questions for each item; 

1. Do you believe this item' is relevant for appraising this course? 
(yes/no) 

2. If you were to construct a student course appraisal sheet would 
you include this item? (yes/no) 

m 

3. Would you want to qualify your response to this item? (yes/no) 

4. Do you believe that you have enough infoimation and/or are 
competent to evaluate those aspects of the course referred to by 
this item? (yes/no) 

Of the 1,286 questionnaires mailed to students, 611 returns were usable 
for a return rate of 48%. Of the 594 faculty questionnaires mailed out, 
265 of the returns were usable, for a return rate of 45%. 
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The data wcro analyzed in the following manner: The proportion of the 
students indicating that an item was (1) relevant for evaluation of their 
course, (2) potentially useful in terms of course improvement, (3) not 
needing student qualification, and (4) capable of meaningful student 
evaluation, was computed. The same was done for faculty responses. The 
students had alsu been asked to evaluate their course through the use of 
the experimental items. These responses were used in order to determine 
the response distribution on the items. 

The intercorrelations among the four questions were computed separately 
for faculty and students. The two (faculty and student) intercorrelation 
matrices showed some similarity and some striking differences. For the 
faculty the correlation between whether an item could be used for course 
improvement and idiether the item should be included in a course evaluation 
fom was .95. For students the correlation between idiether an item was 
relevant for course evaluation and vfliether it should be included in the 
evaluation form was .96. Even though there was a strong relationship 
between the inclusion of an item in an evaluation fom and its usefulness 
or relevance for both faculty and students, there was not nearly as high 
agreement (r « .68) between faculty and students as to i^ether an item 
should be included. This undoubtedly accounts for some of the disagreements 
between faculty and students as v^at should be on an instructional rating 
scale. 

FUrtheimore, for the faculty, negative correlations were found between 
question 3 ("Would you need additional infoimation to interpret the 
response to this item?") and the other three questions. For the students, 
question 3 ("Would you want to qualify your response to this item?") 
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yielded no negative correlations witli the other three questions. Thus, 
it seemed that faculty wanted additional information with respect to 
many items but students felt that a categorical yes -no answer was sufficient. 
And another source of difficulty may result from the fact that the correla- 
tion between whether the faculty believe that the students have enough 
infonnation to answer the item and whether the students believe that they 
have enough infonnation was relatively low, only .54. Very likely, this 
difference of opinion between students and faculty decreases the effectiveness 
of student involvement in educational decision-making, in general. 

As the study proceeded it was agreed that a subset of the original 223 
items would be selected for a second experimental SIRS form by equally 
weighting faculty and student opinion in the following manner. 

An item was selected for inclusion in the next experimental form if: 

1. At least 70 percent of the students and at least 70 percent of 
the faculty indicated that the item (a) could be used for 
course improvement (is relevant for course appraisal) , (b) should 
be included in an evaluation form, and (c) could be competently 
evaluated by students. 

2. It had a pooled student- and faculty average higher than 80 
percent on the above three variables. 

If any of the items which fulfilled these two criteria were rated by 40 
percent or more of the students or the faculty as needing qualification, 
the items were rewritten and then included in the form. 

Thus, only those items which were judged as being relevant, warranting 
inclusion in an evaluation form, and capable of meaningful student evaluation 
by 801 of the combined student and faculty sample were designated as pilot 
items. 

Through the use of the above procedures 56 items were selected for the 
first pilot form. These items were divided into six categories, each 



category preceded by a title which appeared to characterize the general 
topic covered. Bleven biographical items, (class level, age, course 
required, sex, course recomnended, marital status, nmiber of credits earned, 
preconceptions of this course, G.P.A*, nunber of other courses in the same 
department, and grade up to now) were also included in the questionnaire 
in order to assess the relationship, if any, among these variables and the 
course evaluation items. 

This pilot fom was administered in the winter of 1968 to 2,841 
students in large introductory level courses taught by 36 different instructors. 
Various types of analyses were performed on the resulting data including 
a Varimax factor analysis. Fiyfe factors were identified and interpreted 
as follows: 

Factor 1 . Consisted of eight items and appeared to be related to 
instructor characteristics s'lch as instructor involvement and attitude 
towards teaching. ' (INSTRUCTOR INVOLVEMENT) 

Factor 2. Consisted of seven items and appeared to be related to the 
students* interest In the course and the students' performance in the course. 
(STUDENT INTEREST AND PERFORMANCE) 

Factor 3. Consisttni of six items and appeared to be related to student- 
instructor interaction in terms of personal comnunication between students 
and faculty members. (STUDENT- INSTRUCTOR INIERACTION) 

Factor 4. Consisted of five items and appeared to be related to the 
difficulty and speed at which the course material was presented. (COURSE 
DEMANDS) 

Factor 5. Consisted of seven items and appeared to be related to the 
organization of course materials and lecture presentations. (COURSE 
ORGANIZATION) 
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From data collected thus far a 20- item scale consisting of four i tems 
for each of the five factors was tested in the sumner of 1968 to (1) determine 
the stability of the factor structure under both a two-choice and a five- 
choice item foiroat and (2) pre-test a machine-scored fom. Approximately 
half of the 1,200 students tested received the two-choice fom and the 
remainder the five-choice form. These data indicated that the factor 
structure was stable and that the five-choice format had superior operating 
characteristics in addition to being more favorably received by both 
students and faculty. 

During the remainder of 1968, items pertaining to laboratory and 
recitation sections were developed by a process similar to the one described 
above. Also, a general purpose item. No. 21, 'Vou generally enjoyed going 
to class" was identified and included in the final form which now contained 
21 instructional evaluation items, four student background items, and 
three laboratory and recitation items. 

Copies of the instrument, printed on Optical Scanning sheets as attached, 
were made available to the College of Agriculture, College of Engineering, 
College of Social Science and the University College in the spring of 1969. 
These four colleges administered and had scored 8,012 foms. 

For a last comprehensive check on the item structure, correlations 
over the 21 instructional evaluation items for the 8,012 respondents were 
computed, factor analyzed, and subjected to a Varimax rotation. The 
structure remained stable and the pattern of iteni loadings was identical 
to that of earlier studies. 

It, t erefore, seemed reasonable to assume that feedback to the 
instructor could Consist not only of the mean responses on the 21 itents 
but also of the mean response of each of the five factors. Since the 
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loadings for each item were of a high magnitude, a good approximation to the 

factor means could be obtained by simply averaging the means of the four 

items most heavily weighted on each factor. The five averages would comprise 

a composite profile of the instructor's evaluation on the five dimensions 

of the learning situation. This foxmat was later incorporated into the 

SIRS Report. Internal consistencies (average inter- item correlations 

corrected by the Speannan-Brown formula) were computed for each of the five 

dimensions and were as follows: 

Instructor Involvement - .81 
Student Interest and Performance -.79 
Student- Instructor Interaction - .84 
Course Demands - .73 
Course Organization - .83 

After the analysis of the data from the spring 1969 administration, 
the decision was made to consider tlie rating foim finalized aind to proceed 
to develop the rest of the evaluation system. It is inqwrtant to remember 
that SIRS was never seen simply as a paper-and-pencil rating instrument 
but rather as a system for co.'ecting, analyzing, displaying and interpreting 
student reactions to classroom instruction and course content in order to 
improve the quality of that learning situation. The rating fom obviously 
related to the collecting aspect of the system. It should be noted that 
the final form contained some blank spaces in which the instructor may 
insert optional items of his own choosing. The student responses for these 
items would be suinnarized in the SIRS Report which is discussed below. 
In addition to this flexibility, the back of the SIRS Form is available 
for more general conments or specific reactions to specific questions. 

The SIRS Fom was designed to be processed by an Opt Scan 100 Dm 
Optical Scanner vAich produces a 800 cpi-9 channel magnetic tape. This 
tape is read into an IBM 370-155 (initially an IBM 360-60) computer which 
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analyzes the student responses and produces a print-out known as the SIRS 
Report. Much trial and error went into the programming and format of this 
Report to make it readable and understandable by faculty members with 
little or no computer experience. 

the Report lists each question from the SIRS Form along with the percent 
ages of students marking each of the five positions from Strongly Agree to 
Strongly Disagree. The mean and standard deviation of responses for each 
question are also shown. These are coiiiputed using a 5-point scale where 
Strongly Agree is assigned a value of 1 and Strongly Disagree a value of 5. 
Also for each question, percentile ranks are given indicating how the mean 
for this particular administration compares with previous administrations 
uf SIRS in the same course, in all courses in that particular department 
and in all courses in that particular college. In all cases the percentile 
rank listed indicates the percent of previous administrations that resulted 
in mean ratings that were less favorable than the present mean rating. 
In other words a high percentile rank indicates that the mean rating for 
a question in this particular administration is higher than most mean ratings 
from previous administrations. 

Of course, this audience recognizes that this is a relative system of 
comparison and that in any given situation half of the administrations will 
result in mean ratings that will be above average and half below average, 
regardless of the general level of instruccion. Nevertheless, in ou- 
opinion, it seems desirable to present this comparative data since otherwise 
student reactions are very hard to interpret because they tend to be overly 
positive. Such inflated ratings often present a misleading picture to the 
instructor who receives mean ratings near or above the midpoint of the 
scale vdiere, in fact, his ratings may be quite low when compared witli many 
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of his colleagues. 

In addition to the data presented for each question the Report 
also presents a canposite Profile in which sunmarized data is shown for 
each of the five areas mentioned above, namely Instructor Involvement, 
Student Interest, Student- Instructor Interaction, Course Demands, and 
Course Organization, Here again means, standard deviations and percentile 
ranks for the course, department and college are shown for each area. 
Since these data are based upon the average over four questions for each 
area, they tend to be somewhat more stable than data for individual 
questions. 

To assist users of SIRS in understandini and interpreting their 
Report a SIRS Manual was developed to support the system. The Manual 
summarizes the purposes and characteristics of SIRS, gives information as 
to vdiat this data mean and how they should be interpreted, lists some 
precautions in using the Report and presents a variety of questions that 
instructors may want to use as optional items when they administer the 
SIRS Form. To date, reveral thousand pf these Manuals have been distrib- 
uted and the overall reaction to the document has ueen quite favorable. 

After the total system had been thoroughly reviewed and experimental 

administrations had been given to another several thousand students in the 

fall of 1969, SIRS was reconmended for adoption on a mandatory basis for 

the total University. After considerable debate, all generally constructive, 

the University Academit Council, which is the highest faculty governance 

entity, on December 2, 1969, passed the following resolution : 

Use of the Student Tnstructioril Rating Report 

The use of the Student Instnactional Rating 
Report (SIRR) should be adonted with the full 
realization that it is but o.iie parameter of 
instructional evaluation. 
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A. The regulations for the use of Student Instructional 
Rating Reports in effect since January 20, 1949, will 
be declared void on adoption of the new policy.. 

B. Each of the teaching faculty (including graduate 
assistants) at Michigan State University regardless 
of rank or tenure is required to use the Student 
Instructional Rating Report to evaluate (1) at 
least one course in every quarter in which he 
teaches and (2) every separate course he teaches 
at least once a year. 

C. The results generated by the Instructional Rating 
Report shall be evaluated at the departmental level 
in order to help deteimine individual effectiveness. 
Appropriate procedures for the execution of this 
evaluation shall be detemined according to depart- 
mental or residential faculty perrogatives. 

TVro aspects of this action seemed rather interesting. First, the 
resolution made the use of SIRS mandatory across the board. The requirement 
that an instructor obtain student feedback pertaining to his instruction 
no longer applied only to loiver ranks and/or to relatively new faculty members 
as had previously been true. But second, and even more drastic, the resolu- 
tion for the first time officially recognized that student reactions to 
instructors are no longer the sole property of the particular faculty member, 
but belong, in part, to that segment of the University involved in making 
decisions with respect to the acadt;nic effectiveness of this faculty 
member. I am still not completely convinced that the Academic Council 
members were fully aware of the implications of the rescaution they passed 
and now, over three years later, it has not been seriously challenged. 

I would like to discuss briefly one additional aspect, a very important 
one in my opinion, of SIRS before reporting what has happened since the 
system was adopted officially. Any system must be internally reinforcing 
if it is to be self-iinproving. The components of SIRS described above will 
be internally reinforcing to the extbrt that the Fom is accepted as useful 
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and the Report is relevant and understandable. But there must be an 
additional mechanism to make the system complete. To be specific, if an 
instructor who is utilizing the system decides, from his results, that he 
needs additional assistance, such assistance must be available. At MSU, 
the Instructional Development Services in the Provost's Office is designed 
to serve this function. The Instructional Development Service includes 
three different supportative agencies: (1) The Learning Services assists 
instructors in analyzing their instructional situations, in the development 
of their objectives and in the structuring of actual learning experiences. 
(2) The Instructional Media Center provides a full range of consultative 
and supportive services in the audio-visual area, including closed circuit 
television, and (3) The Office of Evaluation Services provides consultation 
services and technical assistance in the area of classroom evaluation and 
test construction and analysis. All three of these offices have well 
qualified professionals who vrork in a face-to-face situation directly 
with faculty members who are trying to better understand and improve the 
learning that takes place in their classrooms. This is the segment of an 
instructional evaluation system that is too often lacking. Granted, these 
are relatively expensive oper?tions but, in my opinion, they are vital if 
the quality of instructipn is to be improved through the utilizatici. )f 
student evaluations of instruction. 

Now back to a brief discussion of what has happened since December, 1969, 
with respect to SIRS. First, the system is being used widely. As of the 
middle of fall term, 1972, noimative data was available for 9,326 administra- 
tions inv-lving 318,654 student responses. Nearly 100,000 more responses 
have been collected since then. SIRS administrations have been processed 
for classes in every college and most teaching departments of the University. 
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However, it should be mailo clear that not nil departments are reacting the 
same to the December 2, 1969, Academic Council Resolution which you recall 
did give the department considerable leeway as to how SIRS should be 
utilized. Several departments are stressing the requirement that all classes 
be evaluated and that the results must be available at the department level. 
Others are allowing the individual instructor to decide whether or not 
he submits his SIRS results to tlie department. And a few departments have 
decided that the SIl^ fonn is not appropriate and have developed an 
instrument ot their own. Several of these iacori;x)rate much of the SIRS 
approach, others are completely different. 

Comments with respect to SIRS have ranged from veiy supportive to very 
critical. The most common criticism is that the Form is too "blah," i.e., 
it does not ask the jjnix^rtant questions. This comes from both students 
and faoilty. But if >'ou recall the method by which the questions were 
selected this is not entirely unexpected since the original data indicated 
considerable disagreement between faculty and students as to what was 
important and what should be included on an appraisal form. Yet, only 
those items upon w}\ich tliere was high agreement were included. We 
recommend that instructors include those questions about which they feel 
strong]" as optional iumn on the SIRS Fomi or administer a Cv>«ii.'len;entary 
form in addition to the SIRS Fonn. 

Another area ot cojicern iias developed which is much more difficult to 
cope with. Many faculty meiftbers are quite concern^ with the lack of uni- 
fonnicy as to how the SIRS forms are administered and used. Ihese people 
feel that if faculty persoimcl decisions are to be based, even in small 
part, on SIRS results, then it ii. important that sucL data are collected 
under the same stajidardized procedures. It is our belief that systems 



for admijiisterijig the SIRS forms should probably be the rcsponsi})il:ity of 
colleges or departments. A university-wide system would probably be 
unwieldy and unresponsive to departmental needs . 

1>40 other areas of concern are worth mentioning. There seems to be 
seme tendency on the part of some faculty and administrators to act as if 
we have the problem of the evaluation of instruction solved. Many of us 
have argued strongly but evidently not too effectively that student evalua- 
tion of instruction is only one dijnension of this overall evaluation process. 
In our opinion, classroom visitation and observation by colleagues and 
administrators can provide useful data. Similarly, some of us would like 
to see more attention given to attempts o measure changes in student behavior 
as relevant information for evaluating teaching effectiveness. However, 
I do not want to minimize the importance of the student input but only 
to argue for additional systematic input into the total process. 

And finally we are genuinely concerned about improper or over- interpreta- 
tion of the data provided in the SIRS Report. We occasi-^nally heai where 
some instructor is called into question because his comparative norms have 
dropped a couple percentage points. Or some department chairman cannot 
understand why some people in his department are below average. Or some 
instructor receives a mean rating considerably above the midpoint of the 
scale yet receives a nomative rating at the 30 percentile rank. Wc try 
to answer these queries by phone or in person when they are brought to our 
attention. But in an attempt to answer these* and other unasked questions 
we have to date prepared four SIRS Research reports.^ These are: 



These reports and other SIRS support materials are available from the 
Office of Evaluation Services, 202 South Kedzio Hall, Michigan State 
University, East Lansing, Michigan 48823. 1 fid 
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1. Analysis of SIRS Responses for Winter Tcnn, 1970 - Feb., 1971 

2. Stability of Factor Stioicture of SIRS - Nov., 1971 ^ 

3. Using SIRS Data in the Decision-Making Process - March, 1972 

4. Student Instructional Rating System Responses and Student 
Characteristics - May, 1972 

Very briefly. Report #1 presents summary data from early SIRS administra- 
tions for the total University (remember our norms are for course, department 
and college) . Data is also given for SIRS responses by level of course, 
by reason for taking course, and by level of grade point average. 

Report #2 presents the two independent SIRS factor structures that 
were produced as the instrument was developed. It is interesting to note 
that Dr. Raoul Arreola of Florida State University reported at the 1973 
meeting of NCME in New Orleans that he had factor analyzed the results of 
the MSI SIRS administrations at FSU and had gotten a factor structure very 
similar to that which we had obtained. His dati further supports the rather 
remarkable stability of the factors mentioned earlier in this paper. 

Report #3 was designed to provide SIRS users, particularly those using 
SIRS data in personnel decision-making, with a more sophisticated explanation 
of the nature and limitations of SIRS data, especially the percentile norms. 
Precautions and illustrations of appropriate and inappropriate inteipreta- 
tions were discussed in considerable detail. We have some evidence that 
this document has been useful but it has certainly not eliminated all 
problans in the area of utilization and interpretation of SIRS data. 

Report <f4 is the first of what we expect to be a series of more ^necific 
research oriented presentations. This particular study investigated the 
effect of administrating SIRS forms under two different conditions of 
student identification. One, the regular condition of anonymity and a second 
mode in which the studsnt records his student number on the SIRS form. 



The latter method of adiniiustering the Slits fonn would make it easier to 
design studies to investigate relationships between student characteristics 
and responses to the SIRS form. While the results of the study are some- 
what limited, there is considerable evidence that the change from student 
anonymity does change the SIRS responses. This suggests that it will 
probably be necessary to collect student characteristic data in the same 
anonymous fashion and at the same time as the SIRS administration if these 
interrelationships are to be meaningfully studied. 

Another SIRS study presently underway investigates the type of response 
scale. The SIRS form uses Strongly Agree to Strongly Disagree. Students 
tend to use only the Strongly Agree or Agree response categories for many 
SIRS items. While it is gratifying to know that MSU students have such 
positive attitudes toward their instructors, it is difficult to make 
statistically meaningful discriminations between instructors. One of our 
graduate students is conducting a doctoral study of alternative response 
scales to see if student responses can be made less lenient and, therefore, 
more discriminatii\g. 

But what of the future of SIRS at MSU? Certainly all is not sweetness 



and light so SIRS will continue to receive more than its share of scrutiny 
due to the delicate area with which it is concerned. The use of the siiine 
instructional rating form for both administration decision-making aiid as 
a feedback mechanism to the instructor for purposes of improvement will 
continue to be questioned. We are inclined to think that it would be 
better to have two types of instruments to meet these quite disparate 
purposes. It might be better to use one fom that concentrates on widely- . 
accepted instructional practices such as meeting the class regularly, clearly 



defiriing the objectives of the course, communicating 
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of student evaluation, and so on. This "responsibilities" foim could 
be systematically administered by the departments and the results used 
in making faculty personnel decisions. In addition, instructors could 
use the present SIRS form or an extended version specifically tailored 
to specific instructional settings to provide them with diagnostic data 
that would be more useful for instructional improvertient. Results from 
this latter type of feedback could be submitted through departmental 
channels if the instructor so desired but would not be required. 

Some changes probably need to be made in the SIRS norm system. It 
might be better to report percentile bands rather than specific percentiles 
since the present system suggests a higher degree of precision than we 
would prefer. The question of. current norms vs. cumulative norms also 
needs further attention. Cumulative norms, as the system presently 
uses, maximizes sample size which is important iii courses and departments 
with small enrollments. However, attitudes of students do diange markedly 
over time vfliich reduces the value of data obtained some time earlier. 
Ver>* likely some combination of current norms, say from one term earlier, 
for large enrollment areas will be introduced. 

A subcommittee of our University Educational Policies Committee, 
the committee that approved the original recommendation to our Academic 
Council, has been assigned the task of reviewing the present status of 
SIRS and making reconanendations for its improvement. Most of the points 
discussed in this paper have been brought to that group's attention. 
While it is unwise to predict the outcome of a conmittee's deliberations, 
I expect that, while some changes will undoubtedly be recommended, the 
present Student Instructional Rating System will continue to be a viable 
aspect of the instructional program of Mchigan State University. 
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MICHIGAN S1ATE UNIVERSITY 

STUDINT INSfKin UoNAl KAn>\^G SYSTf M fQRM 
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SA if inn Mi'ii.jjl^ 4j|rirf with the JMte^nenf 
A ' if fOu .tgre«> mith the SMtpn«nl 
S ' if you n«>lhrr iigree nOf diugree 
0 v'Hi (lii.igrcf wth the statement 

SO ' if you strongly disa gree with the statement 
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3. 
4 

5. 
6. 

;. 

8 

9. 
10. 
U. 
\2. 
13. 
U. 
15. 
16. 
17. 
18. 
I). 
20. 
21. 



Thf insiiuttof ^as e-»tHu*iast»c *hen presenting course nuter lat.- — — — — — — — — -~ ^ — — 

Tm» ir^t'utor socii^d he intercfied m tenchmg. — — — — ——.— — — — — — — ^ — — — 

The msiiuctor's use enmoies or personal eipenencef helped to get pcmls across 'n class. 

The instructor sefned !o be concerned with whethcf the Students learned the material.— — — — — — 

You «>»ere intereiled m learnmg the courst matef lal.- *- — — — — — — — — — — — — 

You were generally attentive m class. — — — — — — — — — — — — — — — — — —— — —— 

You felt that th»s course challenged you intellectually.— — .*- — — — — — . 

You have become more competent m ihi* area due to this course.- — —— — — — — — — — — — 

The instiucter encouraged students to express opinionsr — —.•-. — — — — — — — — — — -^ — 

The instructor appeared receptive to new ideas and others* viewpoints^ — — — — — ^ — — — — . 

The stotfeftt had in opportunity to ask qudstions.* — — —— — — — — — — — ^ — — — — , 

The instructor generally stimulated ciasi discussion. — — — — — — — — — — — — — — — —. 

The instructor attempted to cover too much mitertair- ' — — — — — — — . 

The instructor generally presented the material too raptdlyr* —. — — — — — — — — — — — — — - 

The homework assignments were too time consuming relative to their contribution to your undefstindlng of the courst niateriaL 
You generally found the coverage of topics m the assigned readings toe difficult. — — — — — — — — — — — «- 

The instructor appeared to relate the course concepts in a systematic fnannef — —. — — — — — — — — — — — 

The course was well organized. — — — — •» — *. — — — — — — . — — — — — — — — — — — — — — 

The instructor's class presentations made for easy note taking. — — — — — — .— — •^•^•i^.**^^.^ — — 

The direction of the course was adequately outlined^ — — — ^ — — — — — — — •• — —,« — — _ — 

Ydu generally <n|oyetf going to class. 
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Instructor mav insert three (3) items In these spaces. 



STUDENT BACKGROUND . Select the most appropriate alternative. 

2S. Was this course required m your degree programT— «•* — ^ — — — — — — — — — 

28. Was this course recommended to you by another student^ — ^ — — — — — — — — — — — — — — — . 
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28. How many other courses have you had m this department^ U) none (b) 1 '2 (c) 3-4 (d) S-< (t) or n»re — ■ 
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27. 
28. 
29. 
30. 



DO NOT WHITE BELOW THIS LINE UNLESS THiS COURSE HAS UBORATORV OR RKCITATION SECTIONS 



LABORATOnY Or RECITATION fftil in yow recitation or lab r^umhor at the bottom) 

31. The laboratory or ncitallon tnstructor clarified lecture rnateflal/-. — — — -. — — — — — — — — - 

32. The laboratory or recitation instructor adequately prepared you for the material covered in his section—. — 

33. You generally found the laboratories or tecitat'ons Interesting. — — — — — — —.•*«*-.«*..^-. 



34, 
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instructor may insert two (2) items m this soace. 



IMPORTANT 

WRtTt and MARK in tho to»e» to the »»ghi your recitation o* laboratory section number. 

nV"^ber ] would J^f ^"il«!L.?nl{!!aik5l.Qiil section numb er IS would be wrtt^r^ 
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Ai^pendix (continued) 
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CRimiA FOR 'ITIB liVAlAIATION OV COLUiGR llJAailNd: 
imm WiLIAIilLI'lY AND VALIDI'IY 
AT 1102 UNlVURSriY 01* lOLUlX) 

Richard R. Perry and Rocint R. Haumann 

Ui\iversity of Toledo 



This paper contains two conipleinentary sections. The first section, 
written by R. Perry, describes the search for teaching characteristics 
which are critical in the discussion of teaching effectiveness. 
R. Baumann provides information about tlie Student Perception of Teaching 
Effectiveness Scale wliich was built primarily upon the findings of 
R. Perry's study, and has been used for six years at the University of 
Toledo, College of Education. 

Introduction 

Faculty of colleges and universities have always been under the 
searching eye of those who evaluate performance. This evaluation is 
proiii>ted, hopefully, by the widespread interest of society in the educational 
process. Widespread interest and consequent evaluation has sometimes had 
serious effects on those who are being evaluated. We are all aware that 
Socrates was executed in Athens iii 399 B.C. as a result of the evaluation 
of his teaching which ended with the accusation that he should b? done 
away with because of "introducing new gods and corrupting the youth." We 
are aware that in the early medieval universities physical abuse and death 
was sometimes the consequence of the evaluation of teaching. Cecco dc 
Sacoli was burned at the stake at the IMiversity of Padua in 1237 for 
ineffectiveness in his teaching of astrology. George Whitfield, ment>er of 
the faculty at Harvard in 1745, was severely censored for being impious 
and enthusiastic and possessing a conceit about his own worth and excellence. 
Tiiis all resulted because he had published a paper in which he accused the 
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universities for havinij now "liccoiiio darkness --darkness that may bo felt 
where previously teaching. Sonic of tJiosc evaluations result in accolades 
others have different ei'fects. 
jlite Problem 

It seems tl\at one of tlie major difficulties associated with evaluation 
is that if evaluation is ijoing to take place, someone, someha\r, must 
identify tlie criteria on wliich the evaluation could be based. Tliat has 
been a major problem in evaluation of college teaching. 

Identification of teadier effectiveness is so complex tliat apparently 
no one knows today what "the coiipetent teacher is." llie anonymity of the 
"conpetent teadier" has been tiie spur for countless researcli studies. 
Cage (1360) stated that literature on teacher competence is ovei-whelniing; 
so much so that even bibliographies on the subject are unmanageable. 
Although numerous studies arc reported in tlie literature, few if any 
"facts" are firmly estal->lished about teacher effectiveness. There is no 
approved method of measurini; competence which has received wide acceptance 
(Biddle, 1965). Hie statements by Gage and Biddle support the need to 
focus attention on tlie identification of criteria. Harm that can bo 
accomplished by using inappropriate criteria suggests research to identify 
characteristics of effective teaching behavior. 

One of the most serious aspects of the problem of identifying effective 

teaching behavior is that without such explicit identification evaluations 

which take place are suspect. Significant faults which are assigned to 

present methods of evaluation focus chiefly on the following inadequacies: 

1. Criteria included in evaluations have not been warranted by 
adequate researcli. 
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2. Persons who do evaluation arc criticizctl for their lack of 
expertnoss in tl\o voiy field in which they are operating. 

3, Evaluation of teacliing behavior has not proven to produce 
higli reliability in longitudinal studies, when total 
effectiveness of teacliing behavior is considered. 

Ihe lack of conclusiveness of previous investigations has not diminished 
the zeal with whicli the results of sudi investigations are put forward. 
Perhaps, the most useful result of all sucli examinations and exiieriiuents 
is to more clearly identify the problems experienced in trying to arrive 
at clear definitions of effective teaching. A most important consideration 
in such research is to understand that substantive evaluation can take 
place only in terms of explicit objectives. Until objectives are defined 
and agreed upon evaluations tend toward spuriousness . However, a corollary 
to the establishment of objectives is the identification of criteria of 
teaching behavior which, hopefully, will elicit, or at least assist in, 
the attainment of teaching objectives. Even when a careful definition 
of desirable outcomes (objectives in teaching) is attained, it does not 
solve the criterion problem. 

After objectives have been established for an educational program, 
it is necessary to identify those criterion behaviors which will liave to 
produce the objectives , the criterion behaviors of teaching related to 
the objectives are then useful in the pursuit of evaluation of tlie teaching. 
Since the major problem in research on teaching behavior is that of criteria 
(McKeachie, 1963), it seems that research on the identification of criteria 
which can be warranted for the evaluation of effective teaching behavior 
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might bo holpmi. 

Such attempts in higlior education are not now. 'ITiey liavo increased 
in frequency in the last ten years. Research on the identification of 
warranted criteria received much impetus from tho work of Ryans wlioso 
argument for such research indicates that tliere are good teachers and 
good teaching* and that characteristic behavior associated with such 
teaching should bo able to be identified. Even though they may be 
identified it can be assumed that not every teacher can possess all the 
"good" behaviors or cliaracteristics ; thus the goal of such research needs 
to be the identification of those criteria of teaching behavior which are 
c.'itical. The identification of such criteria has been left often to the 
expert opinion or to administrative standards. The use of such authority 
has resulted in criteria proving unfruitful and of temporary value. The 
argument has gaijied weight that the place to look for characteristics of 
teaching behavior idiich result in effective teaching is in the behavior 
of teachers. Such reasoning suggests searching out clusters of behaviors 
associated with effective teaching. 

A word needs to be said about the meaning of effectiveness. A single 
piece of lesearch cannot hope to explore all the dimensions implicit in 
a concept such as effective teaching behavior. The majority of research 
studies in this area have focused on the assumption that in searching for 
teaching effectiveness, the research seeks for properties of the teaclier. 
This assumes that effectiveness is an attribute of the teacher. A further 
assumption is that such effectiveness is not seriously deterred by other 
variables. This establishes an hypothesis about the adaptability of a 
teacher to teaching situations (Fattu, 1963). 

The assLmption tliat the effective teacher is one who can accomplish 
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educational objectives with students, aside from other variables, is to 
recognize that er.fectiv(?ncss as a tenii may have several me<inings as it 
is identified with several different teaching situations. There is no 
ham in using the tenn effectiveness as long as it is recognized that 
it is related to a set of particular conditions. 

Effectiveness in teaching in the sense of the st idy done at the 
Univen:ity of Toledo and replicated at the University of New Mexico, 
Las Ciiises, Northern Illinois University and Western Kentucky University 
was taken to mean those beliaviors identified by faculty, 5!tudents, and 
alumni which when made operational would result in effective teaching. 
A Brief Appraisal of Evaluation of Teaching Behavior 

Evaluation of teaching seems to enjoy great attention in the popular 
and professional press but one needs to remember that systems of such 
evaluation have been operative in colleges at least since the early 1920 's. 
Some procedures have resulted in evaluations being given to deans or 
department chairmen who, in turn, are privileged to confer nth faculty 
about the evaluations. Apparently, other systems of evaluation make it 
possible for the results of such proceduies to be made known to salary and 
promotion committees and otliers merely have the results made loiown to the 
professor. 

It reems that none of these systems of evaluation is witJiout criticism 

and a few of these criticisms are helpful in the identification of basic 

faults in such evaluations. Major cri' wisms which are a matter of record 

in tlie minutes of faculty meetings at a private college indicate that: 

1, The present, procedure Croruiot he intelligently considered 
as evaluation of effective teaching but would be better 
named "poll of student opinion." 
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2* nic present system <loes little to help In ilotenwlnln|» 
which faculty will be kept ami which faculty will be 
lost, or which faculty will be attracted to the campus. 

3. Tlioso involved in the evaluation arc not by education, 
experience or responsibility qualified to make the 
evaluation they are asked to make. 

4, Tlie fact that the current ovaUmtion is obligatory upon 

the faculty member is a violation of faculty rights (Antioch 
College, April 25, 1964). 

Ilie abcrtf(f coininents represent a core of a faculty's concern about evaluation 

procedures . 

There are other thoughts which are based on inadequacies in systems 
of evaluation. These seem to center on the following: 

1. An institution will decide to provide for evaluation of 
teaching and will choose evaluation items from rating 
instruments which are already in use at other institutions. 

2. An institution or indeed an entire state system of higher 
education will decide to honor outstanding teachers with 
cash prizes but will leav« the identification of these 
outstanding teachers to the judgments of persons in 
positions of administrative authority, or to impressionistic 
evaluations of individual faculty. The coiunent of one 
professor who found himself involved in a system of higher 
education providing for such identification indicated that, 
"even if you wanted to try out for an award you wouldn't 
know how to change your teaching. This whole reward set-up 

is too much like a beauty contest ( Old Oregon , January- February, 
1966, p. 1.3)." 

3. An institution will make it possible for the evaluation of 
teaching to go on in one college or in one department and 
not in all of the departiients or colleges on a campus. 
Thus, some faculty feel imposed upon while other feel 
deprived of the opportunity for evaluations. 

4. The most serious concemF about the evaluation of teaching 
focus on the question, "Evaluation for what purpose?" 
This question has not been satisfactorily answered on a 
majority of campuses. 

5. An additional area of major concern is finding a satisfactory 
answer to the question, "What criteria can be justified in the 
evaluation of a faculty member's effectiveness as a teacher?" 

There is little question but what evaluation of a faculty member's 
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effectivonoss as a teachor takes place* Students » his faculty colloagitosi 
Olid the adminlstratioi;, it* he should happen to bo known to the administra- 
tion, all comment in one way or another about th^) qualities of teaching 
exhibited by the facult/ member. Implicit in all such evaluations is 
the concept that some faculty must be exhibiting behaviors 5n their 
teaching which art considered to be characteristic of effective teaching. 
Finding out what those behaviors are and determining a relative importance 
for each of the identified beliaviors could be a first step in construction 
of a model or set of behaviors associated with effective teaching in 
higher education at any institution of higher education. 

The Univerrity of Toledo's study on criteria of effective teaching 
centered on identifying effective teaching behaviors arid dete?:mining their 
relative importance. 

There are numerous studies which produce interesting statistical 
results concerning reliability, correlations, and the results of factor 
analysis. Difficulties in some of these arise because of methods used 
in selecting criteria for evaluation instruments. Procedures which have 
established evaluation instruments by choosing criteria already in use 
at other institutions without tasting the warrantability of these criteria 
for the institution where they are to be used leaves something to be desired. 
Statistical analysis can be accomplished with responses given to any 
criteria utilized in any rating instrument, but the question remains as 
to the warrantability of criteria which are put to use in such procedures. 

The IMiversity of Toledo Study 

Background 

Interest in effective teaching is not lew to the University of Toledo, 
but in the last two years it has received increasing attention from the 
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University faculty m\d student body. Ihe administration of tlio University 
in the Spring of 1964 annou*^ ^od tlie estalilishment of four outstanding 
teadiing awards in tlie amount of $1,000 each. These, financed by the 
Alumni Foundation are given to four faculty each year in recognition of 
outstanding accomplishments in teaching at the University of Toledo. 
Hie College of Education at tl\e sfune time introduced structured evaluation 
procedures for i^-s own faculty. Hie College of Education provided that 
at tlie end of each term faculty men4)ers could voluntarily request students 
to respond to an evaluation instrument which focused on the qualities of 
teaching in those courses taught by the individual professor. The 
evaluation instrument not only operated for the individual instructor but 
for the course as well. The criteria in the instrument resulted from the 
studied deliberations of a faculty committee of the College of Education. 
Since 1968, results of tlie College of Education evaluation procedure were 
made kncwn to the individual faculty member and to the salary and promotion 
conoiuttee of the College. 

The Office of Institutional Research at the University simultaneously 
with thes ) developments evidenced an interest in conducting a researdi 
study within the University community to get at the identification of these 
criterion behaviors which could be warranted for use in the evaluation of 
effective teaching behavior in hif.her education. 

The study was proposed to the deans of the colleges and the Faculty 
Conference Committee, all of whom endorsed it. An advisory conmittee to 
the Office of Institutional Ftesearch was appointed. The advisory conmittee 
consisted of a representative of each college appointed by the dean of 
tliat college. The proposed research focused on the central problem of 
evaluating effective teaching in higher education. That problem without 
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question is the identification of criteria warranted for use in such 
evaluations, for unless criteria used in such evaluations can be 
demonstrated as warranted for the purpose at hand, tliey would be irrolevint. 

In structuring the study, the Office of Institutional Research made 
the following assumptions: 

1. Criteria for the evaluation of effective teaching are 
related directly to the academic community in which they 
are to be used, and the place to look for these criteria, 
which are most appropriate for one institution, is within 
the academic comiiunlty represented by that institution. 

2. Criteria for the evaluation of effective teaching in 
higher education should be established as the result of 
consultation with those most directly concerned with 
such teaching; namely, students, faculty, and alumni of 
all the colleges of that institution. 

3. Students, faculty, and alumni should have opportunity to 
express their thoughts freely as to what separate actions 
they believe contribute to effective teaching, without 
their responses being limited by procedures which force 
them to select behaviors from a suggested list of such 
criteria which do not originate within their own conmunlty. 

The First Phase 

The University of Toledo began in the Spring of 1965 and proceeded 
during the academic year 1965-1966, with the first phase of the study, with 
the second phase being comiileted in the academic year 1966-67. The 
first phase contacted a stratified sample of faculty, students, and ahnrail 
to obtain free response identifications of behavior which contributed, in 
the judgment of the respondents, to the effectiveness of teaching. In 
order that this could be done and the data handled effectively, response 
instruments were designed to the configuration of a data card. The 
response instrument, along with a personal data card, was mailed to a 
random sample of the student body stratified by college and class rank, 
to every member of the faculty of the University of Toledo, and to a 
random sample of alumni stratified by college froil0>8ch they had received 
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their degrees. Hach potential respondent of the sample was given a personal 
data card and fifteen response Ins^t rumen ts . 

'rivtrteen thousand six hundred and forty- three (13, 64.^) individual 
responses were received identifying "effective teaching behaviors." 
These responses were received from 812 students, 166 faculty, and 665 alunmi. 
Tl\is result f\l in replies from 10"^ of the student body, 30'o of the faculty, 
«md 8*0 of the alumni degree holders, llie mean of behaviors . identified by 
students was 8.7. The mean by faculty, 8.2; the mean from alumni, 6.8. 

Tliese 13,643 identified behaviors were then "read" by a jury group 
to identify duplications in behaviors. Tlie jury group was looking for 
criterion statements vi/hich said the same thing essentially, although the 
wording of the criterion behavior statement might have been different. 
Examples are the two following responses: 

1. "Ability to keep presentation of subject matter at a 
level comprehended by the student." 

2. "Ability to present subject matter at student level." 
Though the wording is slightly different in each statement, each can be 
valued as meaning the same as the other. The result of this reading 
process was to categorize 13,643 individual behaviors into 60 criterion 
behaviors. The reading procedure had one jury person read the statements 
placing them in categories of sameness and then to have these categories 
checked by second and third jury persons; thus, questions were raised as 
to the appropriateness of the classification of any one of the criterion 
statements. 

An additional result of this reading process was to identify six 
major categories of effective teaching behaviors. These six categories 
contained individual behaviors which groined themselves into major 
behavior categories Representing concentrations of similar kinds of behavior 
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so as to pcmit their ulentification qs major separate areas of tcsirMng 
behavior, llic iilcntlficatlon of the individual criterion behavior and 
the c.Uistcring of these into the six major criterion behavior areas ended 
the first phase of the stiuly. 
Ihc Second Phase 

With the criterion statements on hand, the task was to obtain 
judgjuents of how warranted these were for the evaluation of effective 
teaching belmvior. This was accomplished by designing a response 
instrument in which the criterion behaviors were listed. The order of their 
listing was provided by a random listing of numbers supplied by a random 
number program from the University computer, llie instruments provided 
for a response to the importance of each criterion from critical importance 
through no importance. Each respondent was able to categorize himself 
by checking appropriate spaces. 

A sample of students stratified by college and class rank and a 
similar sample of alumni by college in which they had earned degrees was 
presented with the instrument along with all faculty. Usable responses 
were returned by 756 students, 850 alumni, and 187 faculty. Returns 
resulted in replies fiom 7.51 of the students, 8.6% of the degree holding 
alumni, and 35% of the faculty. These percentages of the academic 
comirainity seemed adequate in view of present research practices (Holland 
and Richards, 1965). Weights of 5, 4, 3, 2, and 1, respectively, were 
assigned to the response areas of critical, above average, average, below 
average, and no importance. Tliese data were coded into punched cards and 
processed for statistical analysis to establish rank order correlations 
for selected categories of responses. Of the 82 rank order correlations 
calculated, 40 were in tJie .90's, 34 in the .80's, md 8 in the .70's, 
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all well beyond the .01 level of sigivUMcauce. 
'Ihe Tttird Phase 

Ihe University of Toledo identified four outstanding teadiers in 
each of the years 1964, 1965, ajid 1966. Responses of this group were 
obtained and processed for tlie sajne statistical analysis as for ot^ier 
selected respondent categories, 'me rank order correlations between tl\e 
••Outstanding Teacher" grouij and other groups were all greater tlian .70, 
well beyond tiie .01 le\ \ of significance. The correlations of tlie ranking 
of the criteria by the outstajiding teadiers with those of all other groups 
in tlie study has the effect of testing the order of inportance established 
in the study agaJiist the judgments of a "jury of experts." Seemingly, this 
is further justification for tlie warrantability of the criteria in the 
order established for them by the responses of the total group. 
A Possible Wei g hting Procedur e 

A criticism often leveled at evaluation procedures is that each 
criterion is assumed to be of the same value. Ihe warranting of criteria 
in this study provides for a value factor to account for thv* demonstrated 
differences in iinportance of eacii criterion. This value factui tor eadi 
criterion was established by assigning *he weighted raw score totals of 
all groups for eadi criterion to tluit criterion. For ease in computation 
and bundling, welglited scores )iave been identified as decimal value factors. 
Such value factors peniiit an evaluatioii insti^mient to be constructed 
including all or selected criteria from the study. An Dffezt iveness Kvalua- 

SSSiS. <:ould iiije criteria from tlie research in the following fashion 

Simple item: 

Oieck tlie term whidi in your judgment bestfc>kiscribes your professor *s 
diaracteristic teaching bttuvior. * «WA 
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Ihis professor deiaonstnites comprohonsivo knowledge of his subject. 

Always Most of the time Occasionally Very Seldo.ii 

Never^ * 

A student marking the space "always" would be giving the faculty 
member a "5" on that item which when multiplied by its value factor 
of ,732 would give him a score of 3.66 on this one item. 

Tlie sum of the products of the criterion ratings and the criterion value 

factors would produce an effectiveness score. 

Findings 

1. All rank order correlations between selected groups of respondents 
are different from 0 at the .01 level of significance for individual 
criteria. 

2. Sixty criterion behaviors associated with effective teaching at 
the Univeri»ity of Toledo have baen established as warranted for evaluation 
of such teaching. 

3. The academic community of the University of Toledo is agreed on 
the importance of the sixty criteria in the rank order which is established 
in the study. 

4. A table of weights of iinportance has been established to account 
for the importance of each criterion. 

5. Rank order correlations are different from 0 at .05 level of 
significance for the major beiiavior categories between 72 of the 78 selected 
groups . 

Observations 

Research on the effectiveness of teaching indicates promise in 
clarifying issues which sarroujKl this presently popular topic related to 
the evaluation of teaching. Such research can also help prevent the 
perpetuation of error in such evaluations or at least provide an analysis 
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of a major problem in any evaluation. That problem is the identification 
of criteria to be used. This study seems to have done this for the present 
at the University of Toledo. An additional useful result of this study 
is the providing of a value weight for each criterion which could be used 
in an evaluation instrument in order that some accounting of the differences 
in importance of criteria used in such evaluations may be accomplished. 

The study reported here is apparently unique in that it provides a 
sample of one institution's total academic comnunity an opportunity to 
participate in consideration of criteria v^ich may be used in evaluation 
of effective teaching. It is the only study apparently in v^ich the 
judgments of a representative saii^)le of a complex academic community on 
such criteria have been tested against a jury of outstanding teachers in 
an institution. 

Of course, significant problems remain in the evaluation of effective 
teaching. They are: 

1. Hie competence of persons doing the evaluation, and 

2. Tlie test of reliability of the criteria and procedures which can 
only be accomplished through longitudinal studies. 

It seems though tJiat a sound beginning has been established with the 
identification of criteria in this study. 
Usefulness of the Findings of this Study 

T\\e University of Toledo was not completely satisfied with the fact 
that it had established^ on statistical grounds, criteria useful in the 
evaluation of teaching and consequently we sought the assistance of three 
other universities who had indicated an interest in having the University 
of Toledo study replicated on their campuses. The criteria which had been 
established iji the Toledo study were then placed in response instrument 

ERIC ' ' '203 



197 



foxin and distributed to sainple populations at the University of New 
Mexico at Las Cnices, Western Kentucky University, and Northern Illinois 
University. We did this because although the IMiversity had completed 
research which substantially identified criteria of effective teaching 
appropriate for the University of Toledo, the question remained as to how 
these criteria would fare under the evaluation of their warrantability in 
a wider form of judgment. 

Invitations to participate in the research were sent to the Offices 
of Institutional Research at New Mexico State University, Northern Illinois 
University, and Western Kentucky University. These institutions agreed 
to participate in the research and accepted the offer of the University 
of Toledo to funiish the materials necessary for the research and the 
services required to process the data and interpret it. The same response 
instruments used at the University of Toledo in identifying the importance 
of each criterion behavior were prepared in quantities requested by New 
Mexico State, Northern Illinois, and Western Kentucky. These were given 
to the randomly selected sample populations at each institution in the 
Spring of 1968 with data being sent to Toledo for processing during the 
late Spring and over the Summer of 1968. The derived ranks for each 
criterion behavior by each university are shoim in Table 1. 

"insert Table I here 

The four universities are in agreement that: 

1. Each criterion beliavior identified in the response instrument is 
warranted for the evaluation of effective teaching. 

2. The criteria are important in the evaluation of such teaching in 
the rank order established by the study. 

3. There is no significant disagreement amon^tft^ reporting categories 
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selected fur study about the rank order importance of these criteria. 

The research effort over the past two and one-half years identified 
witli this study has been fruitful particularly for the following reasons: 

1. Apparently for the first time, large numbers of the significant 
segments of four universities have identified criteria warranted for the 
evaluation of effective teaching in their universities. 

2. For the first time, four public universities have cooperated 
to test the findings of their individual research on effective teaching 
against the judgjnents of other academic communities. 

3. The increasing acceptance of the results of this research by 
students and faculty is an indication that the procedures and findings 
are proving useful. 

Those who have worked with the study for two and one-half years consider 
all of the above useful, satisfying, and one more small step toward the 
establishment of some better ground on which to evaluate teaching but by 
no means the end of such research. One cannot hope to establish a 
universal system for such evaluation: The possibilities provided in the 
procedures here indicate that since there is such high correlation in 
the judgments of these public universities that it can be hypothesized 
that similar results would be found i.n the responses from a larger number 
of public universities. If such v/ere to be the case, we might be on 
the path to the identification of a typology of student and faculty who 
attend and teach at such institutions in terms of their attitudes toward 
effective teaching. Similar research conducted in the sector of private 
higher education or sectarian higher education might produce interesting 
and useful results. 

The College of Education at the University of '^{^ considered 
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the results of the study soon after its completion in 1967, and, conse- 
quently changed their procedures of evaluation by incorporating the top 
15 or 20 of the criteria in a newly designed evaluation instrument. The 
administration of that instrument and the research which has followed is 
described in a ccnrpanion paper attached hereto. 

STUDOT PERCEPTIONS OF TEAQIING EFFECTIVENESS 
Introduction 

Oa the basis of the judgments of teaching effectiveness by several 
relevant populations, students, faculty, and alumni, as described in a 
companion paper, the College of Education, University of Toledo, prepared 
a 15- item rating scale. Fourteen of the items (later revised to nineteen) 
were chosen from those characteristics most often judged as critical in 
describing teaching effectiveness. The fifteenth item (later the twentieth) 
asked the student to provide a global rating of teaching by the instructor 
of the courses in which they were enrolled. It was expected that those 
items preceding the last item would provide a multi-dimensional frame of 
reference within i^ich a mediated judgpient of teaching could be obtained. 
(See Appendix for latest form used,) 

The original intent of the scale was to provide a formal feedback 
routine for the instructors about their instructional methods. Both 
a suninary of the ratings received from the students and their unstructured 
conments were given exclusively to each instructor. In the Fall of 1968, 
the College faculty voted to provide the information from the ratings 
to the elected College salary and promotions comnittee. Such decision 
brought about several problems. One of the major problans was that of 
preparing effective guidelines for the interpretation of numbers ^ose 
truth value did not extend to the fourth decimal place. Another one was 
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that of reconciling the tendency oC students to give poorer ratings U 
they were enrolled in large classes with otiier freshnien and sophomores, 
and to give higher ratings if they were enrolled in small classes with 
other graduate students. 

In an attempt to diminish the bias present by size of class and 
instructional level, thirteen norm groups were established. The rating 
of each course then was made relative to the ratings of other courses of 
the same size and class level. That is, the rating of an instnictor on 
"overall teaching" was transfomaed to a standard score using the appropriate 
mean and standard deviation. The average of these standard scores for 
each course for which the instructor was responsible became the index 
of "effectiveness" as perceived by the students. 

The problem related to interpretation was answered by categorizing 
faculty indices into one of three classifications: upper one-fifth, middle 
three-fifths, and lower one-fifth. Such information was provided to the 
salary and promotions coiiHuittec. 
Construct Validity 

As is undoubtedly well known, the study of the validity of a scale 
alleged to be measurijig a constnict is characterized by the relationships 
of the scores derived irum the scale with other variables, variables with 
v^ich the relationship is expected to be strong as well as variables with 
which the relationship is expected to be minimal or null. Several studies 
have been made with the Student Perception of Teaching Effectiveness Scale 
focusing on the latter set of variables--those with which the relationship 
is expected to be minijiiai. i:>>.sentially, the studies were those of bias. 
If the scale is valid, the rolationsliip of the scores with grades, with 
class si^e, with in.stp^ t Luiv;i I level, with sex, with G.P.A., and the like 

207 



201 



ought to be miniiiiuX or zero. 'Hie tables on the next few pages display 
the infoimtion collected. 

Tables 2, 3, and 4 reveal infonaation wliich suggests that factors 
besides "teachijig effectiveness" are related to the scores derived from 
the Student Perception of Teaching Effectiveness Scale. Table 2 clearly 
indicates the bias extant in class size and instructional level. In a 
nearly perfect order, the rating increases in numerical size from smallest 

Insert Table 2 here 
to largest size classes. Similar ily, though not as perceptible, the 
general ratings by level of instruction increase in numerical size from 
graduate students to fi'eshman- sophomore levels. The relationship between 
the interaction of these two variables and the scores derived from the 
scale has been measured as 0.11 (the correlation ratio- -eta squared). 
Statistically one can remove the bias by "partialling" it out- -by setting 
up separate norms. 

Table 3 also reveals certain tendencies which would suggest a 
relationship between the variables and the scores from the scale. While 

Insert Table 3 here 
the variable of sex of student and the required -elective variable seem 
to have but slight relationship, the "reported" CPA (reported » student 
reported) and expecte<i grade Indicate clearly discernible differences of 
mean ratings over the several levels of each. Wliile the first two 
variables, sex and required-elective, can be dijiiinished through noiming 
procedures, the variables of CPA ;md grade are not so easily dismissed. 
The former is amenable to distortion by student manipulation and ignorance- - 
consider the responses of 85 graduate students with respect to GPA who 
reportedly have received a pattern of grades which would clearly restrict 

^ them from attending classes. The latter variable is2#giable to distortion 
ERIC 
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by act of the faculty member who assigns the grades. And grades seem to 
havfc considerable relationship with ratings. Table 4 indicates that 

Insert Table 4 here 
the average Pearson "x" wit'iin each nom grouping is r « -0.42. (Note 
that a negative correlation would indicate that the higher the grade 
received, the higher the rating given— negative because of the inverse 
nature of the meaning of the scale orders for the two variables.) A bit 
of explanation is in order about the technique employed to obtain the 
correlations shown. The elements in the calculation are class characteristics, 
not individual student characteristics. Each class or course received an 
average student rating; each class also was categorized by the average 
grade given by the instructor to the students enrolled therein. Ttxen, 
within each norm group, and later within each instructional level, the 
Pearson '|r" was obtained. 

The size of the correlations is quite striking. To be sure, v^t is 
offered is a record of but one administration of the scale— Spring, 1972. 
Yet, correlations of -0.78, -0.77, -0.60, -0.59, and -0.55 are so large 
that it would be quite unexpected for them to vanish in another administra- 
tion of the scale. The indictment of the validity is very strong; what 
the correlations reveal is that the variations in course ratings is 
accounted for to the extent of from 30 to 60% by the grades assigned. 
One could argue that those who give higher grades are those who are more 
effective; yet, it would be difficult to convince those v^o reportedly 
have the same students in their courses and have a different line on grades. 
Whatever, this problem must be resolved in some fashion before cie can 
build a reasonable case for validity. 

Scale reliability . The question of reliability of the outcome 
of the scale administration has been given but cursory examination. 
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The question of reliability is not that ot the usual "individual" 
assessment but tliut of tiic average assessment* It would appear that v^ero 
there is consideriil)lo consensus on the rating to be given, to that degree 
there is some confidence in tlie reliability of the average obtained. 
Wheie there is a lack of consensus e.g., a uniform distribution of ratings, 
less confidence appears warranted. A study of the extreme fifths and 
middle three-fifths of the distribution of standard scores, referred to 
earlier as indices of effectiveness, revealed tliat the order of consensus 
is directly related to the order of "effectiveness." The median modal 
relative frequency for the upper one- fifth was 86%; for the middle three- 
fifths, 581; for the lower one-fifth, 42%. 

Other studies . Other studies have had little central focus but to 
pursue "interesting" questions. A factor analysis of the scale was 
Uiidertaken to note (1) wliether we were measuring a unitary trait, and (2) 
if not, v^at factors appeared to be present in the set cf itons. The 
following clustering of items or topics were determined: 
Knowledge and Skill in Explanation Concern for Students 

Meaningful class preparation Fairness in evaluation 

Interest in subject Respect for students 

Knowledge of subject Availability for consultation 

Motivation of students Promptness in returning assignments 

Ability to explain Offer of assistance 

Responses to questions 

Overall teaching 

Use of Teadiing Tools Inspiration 

Examinations required luiUerstanding Encouraged independent thought 

Fairness in evaluation Motivated students 

Value of textbook Respect for students 
Overall teaching 

Pressure to apply tiie means of evaluation often stem from some 

dissatisfaction of that which is to be evaluated. That is, evaluation 

should in some way improve the quantity or quality of the item or process. 
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Table 5 portrays the experience of the College with average student ratings 
for the fifteen item (later twenty item) scale since the Spring of 1968. 

Insert Table 5 Here 
(The fifteenth question and the twentieth question from the initial and 
revised scales, respectively, are identical- -thus the peculiar fomat 
used in the last three columns of Table 5.) It is noteworthy that the 
perception of "overall teaching" and other items have tended to improve, 
albeit, somewhat irregularly. To the degree that student's perceptions 
are accurate, iJie evaluation routine has had a beneficial effect. 

Summaiy. The College faculty, as a group, has recently confirmed 
their opinion that the information obtained through the use of the Student 
Perception of Teaching Effectiveness Scale is useful in deliberations of 
the Salary and Promotions Connittee. That is, such information has greater 
validity than the "gossip" which formed the basis previously for such 
deliberations. It is likely however, that those who make the decisions 
are not cognizant of the caution necessary in the interpretation of the 
information given. We are hopeful that studies that we can general 
together with the infoimation available from others can improve our 
confidence in our results and the decisions mde* It would be extremely 
useful to have access to a "clearing-house" which allowed the concentration 
of information and the dissemination required to make progress. 
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Table 1 

Rankings of Criterion Behaviors by Institution 



Criterion Behavior 



Rankings 

NMs fmr'TO"ur 



1« Evidencing better than average speech Qualities 

2. Constructing tests vrtiich search for understanding on the 
part of the students rather than rote memory ability 

3. Providing several test opportunities for students 

4. Engaging in continued formal study in his field 

5. Ackncwledging all questions to the best of his ability 

6. Motivating students to do their best 

7. Explaining grading standards 

8« Publishing material related co his subject field 

9. Having practical experience in his field 

10. Communicating effectively at level appropriate to 

the preparedness of students 

11* Identi^ing his comments v^idi are personal opinion 

12* Challenging students* convictions 

13* Utilizing visual aids to assist in creating 
subject matter achievement with students 

14. Announcing tests and quizzes in advance 

15. Making written coiranents on corrected returned assignments 

16. Presenting organized supplementary course material 

17. Establishing good rapport with students in classroom 

18. Making an effort to mau students as individuals 

19. Inspiring students to continue for graduate study 

20. Demonstrating conprehensive kncwledge of his subject 

21. Exhibiting an intelligent personal philosophy of life 

22. Encouraging student participation in class 
23* Beginning and ending classes on time 

24. Accepting j testified constructive criticism by 
qualified pereons 

25. Sharing departmental duties with his colleagues 

26. Ha-^rLng irritating personal mannerisms 

27. Establishing sincere interest in subject being tau^t 

28. Taking measures to prevent cheating by students 

29. Recognizing his responsibility for the academic 
success of students 

30. Devoting time to student activities on can^jus 

31. Demonstrating a stable level-headed personality 

32. Returning graded assignments promptly 

33. Patiently assisting students with their problems 

34. Holding membership in scholarly organizations 



26 


25 


27 


26 


4 


5 


9 


5 


27 


28 


29 


32 


24 


29 


31 


28 


12 


12 


14 


12 


11 


9 


5 


10 


40 


37 


42 


45 


57 


59 


60 


60 


20 


19 


24 


21 


7 


6 


6 


7 


28 


27 


41 


27 


44 


38 


52 


43 


47 


48 


45 


47 


39 


41 


36 


46 


22.5 


17 


26 


25 


43 


42 


47 


41 


17 


15 


15 


17 


36 


30 


28 


38 


52 


52 


51 


49 


6 . 


10 


10 


3 


46 


44 


38 


40 


25 


22 


23 


24 


48 


51 


48 


51 


22.5 


21 


21 


23 


50 


49 


49 


50 


53 


54 


57 


54 


2 


2 


3 


2 


38 


43 


32 


31 


21 


26 


17 


18 


59 


58 


54 


58 


35 


31 


22 


30 


30 


32 


34 


34 


16 


18 


13 


20 


55 


55 


56 


56 
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Criterion Behavior 



NMS 



Rankings 



UT 



35, 
36. 
37. 
38. 
39. 
40. 
41. 
42. 
43. 
44. 
45. 
46. 
47. 
48. 

49. 
50. 
51. 
52. 
53. 
54. 

55. 
56. 
57. 
58. 
59. 
bO, 



Being well prepared for class 

Setting higji standards of achievement for students 

Involving himself in appropriate university committees 

Being knowledgeable about the community in which he liv( 

Being readily available for consultation with students 

Displaying broad intellectual interests 

Treating students with respect 

Raising the aspi rational level of students 

Being ^le to show practical applications of subject 

Organizing the course in logical fashion 

^Iaking appearances which assist programs cf community 

Haming the respect of his colleagues 

Uncouvaging intelligent independent thought by students 

Using teaching methods which enable students 

to achieve objectives 

Rewriting and updating tests 

Presenting an extensive lucid syllabus of the course 

Exp laining grading procedures • 

Being consistently involved in research projects 

Seldom usijig sarcasm with students 

Indicating that the scope and demands of each 

aisigninent have been considered carefully 

Being fair and reasonable in evaluation procedures 

Relating course materiel to that of other courses 

Using more than one type of evaluation device 

Being neatly dressed 

Exhibiting a genuine sense of humor 

Encouraging moral responsibility in students by example 



1 


1 


1 


1 


1 Q 




25 


16 


58 


56 


55 


55 


54 


53 


53 


53 


14 


14 


16 


15 


41 


36 


40 


36 


10 


4 


2 


11 


19 


20 


18 


17 


13 


13 


12 


13 


8 


11 


11 


9 


60 


60 


58 


59 


45 


45 


37 


42 


5 


7 


8 


8 


9 


8 


7 


4 


15 


16 


19 


14 


49 


46 


50 


48 


37 


34 


39 


41 


56 


57 


59 


57 


34 


47 


43ii 39 


33 


35 


35 


33 


3 


3 


4 


6 


31 


40 


46 


35 


29 


24 


30 


29 


51 


50 


43Ji 


52 


42 


33 


33 


37 


32 


39 


20 


22 



Note - NMS = New Nk?xico State, Las Cruces; NIU = Northern Illinois University; 

WKU = Western Kentucky University; UT = University of Toledo. Ihe nimber of 
responses on whicli tlie above information is based: NMS = 654; NIU = 2488; WKU 
= 1698; \n 1793. 
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Table 2 

Mean Ratings and Standard Deviations by 
Size of Class and Instructional Level 



Instructional Size of Class 



Level 


1 - 


10 


11 ■ 


• 24 


' 2r- 


"19-' 


50 • 


• 100 


OVER 100 


Graduate 


M « 
SD « 


1.33 
0.50 


M = 
SD « 


1.55 
0.41 


M » 
SD « 


1.69 
0.48 


M - 

SD « 


1.65 
0.63 




Junior - 
Senior 


M = 
SD - 


1.63 
0.48 


M « 
SD » 


1.76 
0.49 


M - 

SD - 


1.76 
0.57 


M - 

SD « 


2.24 
0.41 




Freshman - 
Sophomore 


M - 
SD « 


1.48 
0.52 


M » 
SD « 


1.67 
0.40 


M « 
SD » 


1.83 
O.SO 


M - 

SD » 


1.85 
0.52 


M - 1.91 
SD » 0.35 



Note - Ratings are based on a scale of 1 - 4, 1 is labeled excellent, 4, poor. 
Means and standard deviations sham have been accumulated to Spring » 1972. 
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Table 3 



Mean Ratings for Student Perceptions of 
Teaching Effectiveness Given Certain 
Characteristics of Class Menbers. 



Freshman - Juniors - 



Characteristic 

^ ^^^^ ^F^^ ^^^^ VP^^ 


Sophomores 




Seniors 


Graduates 


N 




IT 




Pi 




Required Course 


718 


1.731 


549 


1.607 


437 


1.556 


Elective Course 


196 


1.714 


143 


1.53X 


249 


1.558 


Males 


369 


1.751 


247 


1.664 


321 


1.517 


Feinales 


562 


1.740 


451 


1.567 


376 


1.614 


Reported G.P.A. 














0.00 - 2.00 


139 


1.604 


44 


1.545 


34 


1.529 


2.01 - 2.50 


283 


1.756 


160 


1.S81 


51 


1.745 


2.51 - 3.00 


217 


1.806 


219 


1.543 


55 


1.509 


3.01 - 3.50 


197 


1.802 


199 


1.643 


200 


1.570 


3.51 - 4.00 


61 


1.836 


55 


1.764 


315 


1.549 


Expected Grade 














A 


356 


1.632 


369 


1.466 


433 


1.513 


B 


333 


1.775 


205 


1.693 


158 


1.677 


C 


149 


2.007 


65 


2.108 


19 


1.632 


D 


26 


2.308 


18 


1.889 


22 


1.864 


E 


11 


1.858 


11 


1.364 


22 


1.636 



Note - N is the nunber of individuals in sucii classification viho made a 
rating in the Spring, 1972. Ihe base for the rating is; 1 - Excellent, 2 
Good, 3 - Fair, 4 - Poor. 
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'J'able 4 

Correlations of Mean Grades Given in 
Course and Average Ratings on Overall 
Teaching by Norm Grouping 
Spring , 1972 



Instructional 
Level 



Graduate 



Junior- Senior 



Freshman- Soph.' 



Total Group 



1-10 

N = 20 
r » - .03 



S ize of Class 
11 - 2? - 49 50 - 100 OVER 100 



N=18 N«14 N = 0 
r = .07 r = -.28 



N « 0 



Graduate, combined r = -.08 

Graduate, size partialled out, r « -0.08 

N = 10 FT^TB Fm5 N » 0 

r « -.78 r -.60 r = -.59 

Junior-Senior, combined r = -0.63 

Junior- Senior, size partialled out, r « -0.65 



TT^ N = 13 • N = 8 

r « -.77 r = -.26 r = -.55 



« 0 



Freshman- Soiiliomo re, confined r = -0.32 

Freshman- Sophomore, size partialled out r « -0.40 



Coiit)ined r = -0,42 

Witli size and level partialled out, r » -0,42 



Note - scales measuring rating of class and grades are inverse in meaning-- "1" 
Uie best score on rating scale, "4" the poorest; "4" is the hi^est score for G. 
is a lower score. 
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1. 
2. 

3. 
J». 
5. 
0. 
7. 
fl. 

9. 
10. 

u. 

12. 
13. 

:m, 

15. 
16. 
17. 
18. 

19. 
'<o, 

21, 
-.2. 



Mcttuingfui class preparation and planning \ 

Deiaonatrated sincere interest in the subject ..... * 

Demonstrated comprehensive knowledge of the subject • ........ ^ 

Employed exams which required understanding of ideas in course • • • g 
remonstrated fairness and reasonableness in evaluating students « • . ^ 

Encouraeed Independent thought by students ... ..... ^ 

Motivated students to do their best ................ 

Demonstrated respect for students ...... ...........^ 

Demonstrated ability to explain coursfe material. ........ 

Responded to questions to the best of his ability. ....... 

Availability for consultation ; . 

Demonstrated promptness in returning graded assigrunents and exams. ., 
Offered assistance to students with problems connected with course ., 



... ...... 



G.P.A. (See Below). . . . 

K>:p-?cted Grade (See Below) 



i<.<-.-w 1, 2, 3, ...19, 20. 
.A. 



C. 

■ ' . 00 
..'.10 
7-.il 
3.v-)l 
3.51 



2.00 
2.50 

2.::,o 

3.50 
'..00 



1st. column 
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D ^th 
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It 
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10 
11 



ttiti 
11::: 
»::: 
tsx:i 
::t:: 

;j«xx 



13^ 
14 



Value of text and/or instructional materials used «c 

Personal interest and sensitivity to student problems 

A^sstcnmonts made were in harmony with course objectives 

OutJide work demanded in line with coui*se credit hours • 

Rating of course content as important and valuable • • • 

Class experiences have promoted learning on your part •^q 

Overall evaluation of teaching in this course* 

Course: Required Col« 1, Elective Col« 2. . ****22 

Sex: Male Col. 1> Female Col. 2. • < 
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xsxx: 
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1:::: 

:::xx 



1:::: 
::::: 
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x::s: 



Code Number of Instructor. 
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Wltat factors influence or are correlated with student ratings of 
tcadiers? Most of the work on student ratings has been strictly empirical— 
begiiuung with general notions about what a good teacher ought to do, 
writing Items about these characteristics, factors analyzing them, and 
then atten?)ting to validate them. But to understand what these ratings 
moan we need to fit them into larger theoretical structures . One way of 
doing this is to relate them to ether variables that we know something 
about. 

Basically we assume that student ratings are descriptive of teacher 
behavior and of the teacher's effect upon the student who fills out the 
rating scale. Insofar as the items of the scale are descriptive of teacher 
behavior we expect high inter-rater agreement, but we expect greater valid 
(and Invalid) variability when we ask for the students' assessment of 
teaching effectiveness or value of the course in their own education. 
Some of us are following Dr. Hoyt in trying to get a clearer picture of 
what goes into such an overall rating by asking about the effect of the 
course on the student's judgment of his achievement of several different 
kinds of goals. From all that is known about social perception and attitudes, 
it seems very unlikely tl-iat judgments of teaching effectiveness are unaffected 
by student characteristics. Thus it is important to knew what student 
cliaracteristics affect ratings and the degree to which a given set of ratings 
are tlie result of autochtlionous factors rather than of the more objective 
qualities the rating was intended to assess. 

213 220 



214 



Anotlier sot of diaracteristics likely to influence student ratings 
of teaching are characteristics of the class or course. Is it easier to 
get good ratings in small classes than in large? Does the teacher of a 
required course have a tougher job than the teacher of an elective course? 
In interpreting a teacher's ratings we usually are influenced by some 
assumptions about such variables. Thus it is important to know how valid 
these assumptions are. 

A third set of factors influencing ratings are characteristics of 
the instructor himself. Does an experienced teache-f* get better ratings 
than an inejcperienced one? Vfliat personality characteristics of the teacher 
influence what he does in teaching and how the students react to him? 

In this paper I do not intend to review interrelationships between 
items on scales for student ratings of teaching nor will I enter the realm 
of correlations between student ratings and student learning or other 
criteria of validity. Each of these topics would constitute a paper in 
itself. 

Student Characteristics 

The classic research on most aspects of student ratings of instruction 
was carried out by Herman Renmers and his students at Purdue. His results 
are still largely unchallenged by more recent research. Among the factoid 
which did not significantly affect student ratings were such student 
characteristics as: 

Veteran/non- veteran status 

Age 

Sex 

Class standing 

Grade in course (However when the top students achieve more than 

expected they rate the course higher, and when the 
poorer students do better thm e3^ected they rate 
the course higher) . 



Student giaracteristlcs 
S expectations 



Personality 



Kclloy ajid Perry, Nlcini, f, .rones (1973) have 
shavn that student expectations affect ratings 
for a single lecture, but we have little evidence 
on tlie dynamics affecting persistence of expectancies 
over a term. 

Cos tin § Crush (1973, unpublished) found no relation 
between traits measured by the Gordon Personality 
Profile and ratings. The organizers of this 
conference hoped to stimulate research, so we did 
some. Our results, like those of Cos tin § Crush, 
were largely negative. Using Cough's California 
Psychological Inventory (the CPI) we obtained only 
three significfint interactions out of fifty tested. 



Content 

Camey and I found some interaction between content and sex affecting 

ratings in a psycliology course. Women like life-oriented topics; 

men liked science -oriented topics. Turner et al found tliat higli 

anxiety students prefer personality-social content. 
Course Characteristics 

Class size Generally smaller classes are preferred, but 

results are not uniform. Often the best teachers 
are assigi^ed larger classes and are rated well. 
Pcrlman (1973) found that students at Manitoba 
.. rated smaller classes hi^er on two major dimensions- 

intellectual stimulation and socio-emotional climate. 
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Required vs. elective Uemnicrs foiuid no dirforcnce, but Lovell 

ft lUmer (.1955) aiid Kapel here at Tenple 
found that required courses were rated 
loiv'cr. 

The relatively small effect of variables such as size or required 
vs. elective lead me to feel tiiat students take account o£ tlie 
teacher's task in tlisir ratings. 'Ihey may give hifjilier ratings if 
tliey think a course is hard to teach. Moreover they may give higher 
ratings if they can assess their learning in conventional ways. 
Hence, there may be a bias toward lecture- test courses which is not 
reflected in real long-term effects. Shillace, for exanple, reports 
95% retention of anecdotes ; 25% retention of the poiiiL of the anecdote 
in lecture. 

Students can judge whether they followed a lecture and can count 
pages of notes. 

Students are less likely to be able to evaluate gains in ability to 
analyze or evalioate. The fact that difficulty of a course has no 
effect on ratings is not as surprising as it may seem. There are 
many ways of making a course difficult, most of which have little to 
do with increased learning. Moreover, students despise Mickey Mouse 
courses. As Walster has shown in laboratory experitnents,"hard-to-get" 
goals are rated higher. Students may neglect to iiiclude in their 
rating skillful planning of method, content, tejctbook, teaching 
technology. But they do give credit for trying, for concern. 
Instructor Characteristics 

Sex - No difference (cf . Centra) 

Age - Younger teadiers are rated higher (Riley, 1959) 

Rank - Results are mixed 223 
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Degree - BA instnictors arc rated lower than MA's or PhD's 
(Riley, 1D49) 

Experience - Mixed results, but mostly some improvement with 

experience (Costin) 

No effect (Centra) 
Grading standards - Mixed results on this in terms of overall 

ratings, but lower graders are rated lower on fainiess 

of grading (Ucilman § Amertrout, 1966) 
Knowledge of subject - No effect 
Knowledge of teacliing - No effect 

Research - Publishers not higher (Aleamoni) . Second authors are 
rated higher. First authors of books were rated poorly 
(Feldliusen) 
Personality of Instructor 
Getzels § Jackson (1963) reviewed 150 studies (public schools) and 
concluded that little is known about instructor personality. 
Tiie same is true of instructor personality and ratings at college 
level. Bendig (1955) and Sorey (1968) found no relationship between 
Guilford-Zimmermon scores and effectiveness. 

In our studies at Mid\igan peer ratings of the general culture o£ a 

teacher correlated positiv.-^ly with student learning and ratings. 

Enthusiasm- Surgency on the 16PF was also positive, 

Costin § Crush (1973) using the Gordon Personal Profile found that 

vigor and student -perceived original thinking, personal relations, 

and ascendency were also positively correlated with student-rated 

effectiveness. 

Discussion 

Tlie results of these studies contain both good news and bad news. A 
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lot of variables that might affect ratings don't have -much effect. 
This is good news in that a number of potential sources of error are thus 
determined to be of little consequence and those of us reporting ratings to 
instructors need not worry about constructing different sets of norms for 
particular kinds of classes or particular kinds of students. The bad news 
is that my hope that these correlates would lead to new theoretical insights 
is also not supported. Intuitively, one feels that one needs to separate 
the effect of the teacher as he teaches in the classroom from that o£ the 
teacher as a person. Each of these must make some impact upon the student 
and in turn upon his ratings. They must have some differential effect on 
different types of students. My faith in the usefulness of such detailed 
analysis remains despite, not because of the richness of, our research findings 
to date. 

I still believe that teaching is a very complex business. Thus I 
think interpretation of student ratings should be left to faculty who 
understand the particular ^^roblems of a particular class and who can make 
allowances for the variables which might a£fect student judgjnents . 

Peers may over -weight some factors, hence our research is worthwhile 
to them. But teaching is still a very human and individual endeavor and its 
neaning is not easilty captured by statistics. 
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TRAattiUS WD MAKE A DIFFURnNCD 

Jerry G. Gaff 
California State College, Sonoma 

« 

Two basic ideas which underlie the use of student ratings are that 
systematic procedures should be used to evaluate teaching effectiveness 
and that students should play an important part in that process. These 
twin assumptions have been operationalized in the form of student ratings 
of their teachers, and the solicitation of such ratings is not at all 
uncommon these days. 

However, most teaching evaluation procedures are quite modest 
efforts. Most (a) rely on student descriycions of their teachers, (b) in 
a classroom, (c) for the duration of a teim, (d) at the discretion of 
individual faculty menisers. This despite the fact that it is obvious that 
(a) students are but one constituency with a legitimate interest in and 
perspective on the quality of teaching, (b) the classroom is only one 
setting in which teaching and learning occur, one which may be beccnning 
decreasingly important, (c) tlie important consequences of an education 
can be observed only over a long time span, and (d) acquisition of knowledge 
about the effects of one's teaching can help all teachers learn how to 
inprove. 

The m£in thrust of my conments today is that it is necessary to go 
beyond this current limited use of student ratings. I am prepared to 
argue that it is iii5)ortant to advance in three areas — in research, in 
theoi^', and in practice. 

First in regard to research. It must be acknowledged that even the 
modest initial efforts to evaluate teaching have generated several useful 
student rating forjns, many research studies, and a number of correlates 
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of clTcctivc tcncliing. llespitc these advances, however, very little is 
kiiowii about the clwractcristics oC teachers, teaching styles, and student- 
teacher relationships which have demonstrable long tcnn benefits to 
students. Wo need to conduct research wliich will provide knowledge about 
the kinds of teachers and teaching which make a difference in tlio cognitive 
and affective lives of students. This kind o£ research probably will have 
to employ methodologies beyond those whicli are connnonplace in the study 
of student ratings. I would like to illustrate the kind of research w]\icli 
is needed by discussing a study which I have recently completed. 

While working at the Center for Research and Development in Higher 
Education at the University of California, Berkeley, I was presented with 
a special opportunity to examine the impacts of faculty on students during 
their entire four year career. Longitudinal studies of student growth and 
development were initiated in 1966 under the general direction of Paul Heist. 
These researchers admiiiistered a set of questionnaires to students when 
they entered as freshmen and again as they were preparing to graduate in 
the spring of 1970. In conjunction with these studies, several colleagues 
and I, v^o had been researching faculty members, conducted a survey of 
faculty in nine of the same institutions during the spring of 1970. We 
related data from 851 faculty members to 1475 students for whom complete 
sets of fresliman and senior questionnaires were available. 

Of particular concern to all of us were certain kinds of teaching and 
learning, those which are usually lumped together under the term "liberal 
education." Although that term cannot be defined sharply, it manages to 
imply a special kind of education which is at the heart of most college 
and university endeavors. In regard to teaching, it means more than 
transmitting facts and tlieories and more than presenting the content of 
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one's academic specialty, however important tl\ese may be. It implies a 
breadth of concern and an attem])t to relate knowledge in one's field to 
other fields of investigations, to realities in the larger society, and 
to the personal lives of students. Similarly, the kind of learning 
which is the fruit of a liberal education transcends the acquisition of 
cognitive facts, methods, or principles, as important as these may be. 
It includes such affective components as acquiring an appreciation of 
the value of intellectual inquiry, increasing sensitivity and awareness, 
and developing a personal philosophy and outlook on life. In short, the 
kind of teaching and learning in which we were interested was that which 
made a difference in the lives of students. 

From the mass of data which were gathered several analyses were 
conducted, but I will discuss only a couple today. One item asked 
senior students to name the faculty member who had "contributed the 
most to their educational and/or personal development" during their 
college years and to describe the ways tliatthe teacher had helped them 
A total of 1127, 77 percent, of the seniors named such faculty members. 
Most of the remaining students in the survey left the item blank, but a 
few wrote in colorful comments like "No such animal," disavowing that 
any faculty member had played a signif icant role in tlieir development. 

Insert Tabic 1 about here 

As may be seen from Table 1, the vast majority of the nominated 
faculty were said to have been available and open for discussions, 
stimulated students intellectually, helped them feel confident, demanded 
high quality work, and intei'ested students in their fields. Fewer, but 
still a majority of the influential teachers, were said to have encouraged 
students to inspect their values, given career arivice, and fostered 
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awareness of social issues. Only a minority counseled about a personal 
problem or helped students got a job or scholarship. Although a couple 
of these statements are descriptive of the teachers, most are descriptive 
of the ways students were helped by them. Generally, the results confim 
that students benefited in several ways and to a considerable extent by 
the teachers named. 

However, this is only the prologue to the issue at hand, because we 
wanted to learn about the kinds of teachers who had such impacts on 
students. Seniors were alr.o asked to name, but not to describe, the 
teacher who had taught the most "stimulating course" they had taken 
during their college careers. A total of 97 faculty members who received 
nominations fran two or more students either as having contributed t:.e most 
or as the teacher of the most stimulating course had returned faculty 
questionnaires. A total of 609 faculty who received no nominations in 
either capacity also returned questionnaires. In one analysis the responses 
to the faculty questionnaire of these two groups were contrasted. 

Similarly, faculty members were asked to name two colleagues whom 
they regarded to be "outstanding teachers" and one colleague whom they 
regarded as having "significant impact on the lives of students." Another 
analysis contrasted the questionnaire responses of the 137 faculty m^bers 
vdio received two or more nominations from their colleagues with the 525 
v^io received no mention from any colleague. 

We first discovered that there was a fair degree of overlap between 
the faculty nominated by students and those named by colleagues. This 
overlap helps to explain why the results of the two analyses are so 
similar that they can best be discussed together. 

More importantly, we learned that there is a configuration of variables 
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which differentiates the faculty who make a difference from the rest of 
their colleagues. First, faculty nominated both by students and by colleague 
evidenced a greater commitment to undergraduate teaching than did the 
non-nominated groups. In significantly greater numbers they registered 
preferences for teaching over engaging in research and for teaching under- 
graduate students over graduate students. 

Influential teachers were also significantly more likely to talk 
witr* students about a variety of issues of importance and even urgency 
to your.j^ adults. In both the student- and colleague-nominated analyses, 
over 50 percent of the influential faculty scored in the top third of a 
scale concerning tlie frequency with which they discuss with students youth 
culture issues, such as sex and morality, the nz:: of drugs, and alternative 
life styles, vrtiereas less than a third of the non-nominated faculty reported 
frequent discussions of this type with students. Such "rap sessions" 
whether they occurred inside or outside the classroom — are evidence of 
the influential faculty's greater involvement with students and their greater 
concern for issues of importance to students. 

In order to sharpen the interpretation of this finding, it should be 
noted that the nominated faculty were not more liberal than their less 
influential colleagues. A variety issues which range along a liberal - 
conservative dimension including political preference, views concerning 
tlie regulation of student social life, tolerance for controversial activities 
of students and faculty, and attitudes toward student participation in 
policy-making failed to differentiate tlie two groups. Further, the 
student-nominated group of teachers did not differ in age from the non- 
nominees; influential teaching was round by students tJ be about equally 
distributed throughout the age groups. Thus, it was not the radical young 
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faculty who were regarded as influential by discussing these youth culture 
issues with students, as might have been suspected. Rather, it appears 
that a willingness on the part of a teacher to explore and analyze these 
topics with students regardless of his age or position regarding them is 
the key to being regarded a particularly influential teacher by one's 
students and colleagues. 

The single biggest difference between influential faculty and their 
colleagues was the extent to vdiich they interacted with students outside 
the classroom. Faculty respondents were asked to indicate how many 
times they had out-cf-class discussions with students in several areas 
ranging from course work to personal problems. Fifty-four percent of 
the student -nominated faculty scored high on the scale of frequency of 
such interaction conpared with 30 percent jAio received no nominations; 
comparable figures for the collea^^ue-nominated group were 55 and 26 percent* 
Perhaps encounters which take place outside of class provide greater 
opportunities for students and teachers to carry on discussions vdiich 
focus on student concerns than the more formal student- faculty relation- 
ships which are found in the classroom. At any rate, these data indicate 
that much effective teaching can be found in settings beyond the classroom. 

If making a difference with students can be thought of as constituting 
its own reward, then infliiential teachers would appear to reap a greater 
sense of acconplishment from their teaching efforts. Forty-four percent 
of the student-nominated faculty scored high on a scale of self-perceived 
influence vdiich measured the extent to vtfiich faculty thought they had a 
impact on students' personal philosophies, decisions about careers and 
major fields of specialization, and appreciation of the values and methods 
of scholarly inquiry; only 27 percent of the non-nominated faculty felt 
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they had as imich influence on students generally. Comparable differences 
were foimd between those faculty receiving two or more nominations from 
tlicir colleagues and those receiving none. Sijnilarly, over two- thirds 
of each group of the influential teachers named a senior to whose 
educational or personal development they felt they had contributed a great 
deal, which was considerably more than the non -nominated groups. Given 
that non-nominated faculty had much less contact with students outside of 
class, it may be that they often did not know their students well enough 
to assess their own impact on them» 

One finding that is particularly relevant to the concerns of this 
conference is that the nominated faculty generally were not distinguishable 
from their non-nominated colleagues on the basis of their classroom teaching 
styles. Thirty- two items descriptive of classroom teaching styles 
were included in the faculty questionnaire. Most of them were taken 
from the well developed and validated student rating scale developed by 
Hildebrand and Wilson and were modified so that faculty could describe 
their own teaching behavior. Reliable scales were developed to measure 
tiie extent to which faculty encouraged students to participate in the 
course, classes were well organized, teachers adopted a relaxed, discursive 
style, and faculty attempted to make their presentations interesti,ig. 
Only the latter scale yielded statistically significant differences between 
nominated and non-nominated groups, and those differences were so small 
as to be educationally insignificant. 

Here then is an interesting anomaly. Hildebraiid and Wilson have 
developed one of the best student rating scales around; they have conducted 
research ;^ich demonstrates that its five scales consistently discriminate 
between effective and ineffc»ctive classroom teachers; but items borrowed 
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from that Instrument failed to differentiate^ between the teaching practices 
of faculty who make the greatest difference in the lives of students and 
their less influential colleagues. 

How may this finding be explained? Of course, it may well be 
that the items were changed in meaning when they were modified for use 
with the faculty, and it is clear that the scales derived from the faculty 
data are not directly comparable to ^ those derived from student data. But 
I am bothered by another possibility, the different contexts of the 
studies. Research into student ratings is generally conducted within 
the framework of a single course. So far as learning the subject matter 
of a course is concerned, the degree of organization, for example, may 
be a significant teaching factor. However, so far as making a difference 
in the lives of students is concerned, the degree of teacher organization 
would be trivial. Although it is by no means conclusive, this finding 
suggests to me that vdiat goes on within individual classrooms may have 
little relevance for the long-term liberal education of students. If 
this is so, a research procedure designed to identify the correlates 
of effective teaching within the context of conventional academic courses 
may systematically fail to identify the kind of teaching which facilitates 
a liberating education. 

There are many other analyses vihich I would like to share with you, 
but since time is lacking, you might find the essence of the study useful. 
The general conclusion of our study is that on most campuses there are 
important barriers to significant encounters between students and teachers* 
Those teachers are most influential who find ways to transcend the barriers 
of age and authority, classroom and content, to confront students where 
they are. Although I have not discussed it today, those students are more 
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effective in reaping the benefits of a liberal education who more aggressively 
use the learning resource.*; of the sdiool, including their teachers, to 
expand tlieir understanding md awareness. And those schools are most 
potent which, whatever the content of their fomial curricula, create the 
conditions for casual, frequent, continuous, and wide ranging interactions 
between students and teachers \\^uch extend beyond the classroom. 

I hope this brief description of a portion of one research effort 
will illustrate my major point that more research needs to be directed 
at the kinds of teaching which are associated with long-term beneficial 
effects on students. There are many kinds of teaching and many kinds 
of learning, and there is a need to learn about the qualities of teachers 
who make a difference in the cognitive and affective aspects of students. 

You will recall that I said we need to go beyond the current state 
of the art of student ratings in research, in theory, and in practice. 
If the discussion about research was rather lengthy, the issue about 
theory may be handled with dispatch. The simple fact is that we lack 
an adequate theory of instruction. Research has identified various 
kinds of effective teaching and, as we have heard, several of its 
correlates. But we are not at all sure how the instructional behavior 
of teachers relates to learning by students. Given this lack of under- 
standing, it is uncertain how the behavior of teadiers may be modified 
to increase the amount of student learning. 

The lack of adequate theorizing is particularly apparent in xhe 
area of student ratings. So far as I know, there is no theory which 
relates student ratings to instructor behavior, to changes in instructor 
behavior, or to student cognitive and affective growth. These theoretical 
questions must be addressed: 
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1. How are the results of student ratings conceived and interpreted 
by teachers? Do faculty selectively perceive the results, and if so, 
what needs does that process serve? 

2. How do faculty ineii4)ers respond to the positive and negative 
results of students' ratings? How do ratings affect their seJf concepts , 
as persons and as teachers? 

3. How do ratings affect faculty motivations to change their teaching 
behavior? Do negative ratings generate anxiety or other defensive reactions 
which impede change, or do they generate a genuine desire to in^jrove the 
quality of teaching? 

4. How do changes in teaching behavior affect students? Do 

the students perceive changes in their teachers, how do they respond to 
those changes, and do they leam more? 

Unless we are able to inprove our theorizing about the role of 
student ratings in the teaching- learning process, I see little hope that 
we can use them to help teachers make a greater difference. 

The third area I want to comnent on is current practice concerning 
student ratings. Even though we lack the desirable knowledge base in 
research and theory, the state of the art of student ratings is sufficiently 
advanced that we may go beyond the usual current practice. After all, 
most decisions we face in life must be made with only incoii5)lete knowledge, 
and on the basis of my own impressions I will suggest a few guidelines 
for inplementing a more comprehensive teaching evaluation procedure. 

1. A formal system of teaching evaluation should be established for 
all faculty members simply because it may provide the best available 
knowledge about the consequences of teaching, and because such knowledge 
is necessary if faculty menbers are to iiiprove their performance. In 



229 



order to make the evaluation the integral part of the instructional 
process it deserves to be, it . bo possible to place the responsibility 
for obtaining reliable knowledge about his teaching on each faculty member* 

2. Although student ratings are better than less systematic attempts 
to leam of student reactions to teachers, and although the evidence 
indicates considerable overlap between student and faculty judgments, 

the student viewpoint is only one which needs to be considered. A mre 
conprehensive teaching evaluation procedure vdaich solicits appropriate 
inputs from students, the teacher himself, his teaching colleagues, and 
administrators all of whom have legitimate interests in the quality 
of instruction would seem to be a more desirable procedure. 

3. Evidence about the classroom performance of teachers is inportant 
but not sufficient, as the research data I have discussed indicates. It 

is particularly important to leam about how faculty interact with students 
beyond the classroom. Indeed, recent years have seen the advent of a 
mjober of new settings for teaching --independent study, community action 
projects, work-study prograjiis, experiential learning, external degree 
programs — in which the traditional classroom plays a more limited role. 
Evidence about the kinds of teaching vrfiich occurs in these expanded contexts 
must also be taken into consideration in teaching evaluation procedures. 

4. Rather than a one -shot affair, teaching evaluations should be 
conducted on a continuous basis. A regular and continuous procedure 
would identify the degree of progress, stability, or even regression in 
performance and point the way for various actions vdiich might assist 
each person to achieve to his fullest. 

5. Although it is useful for faculty menfcers to leam of the 
results of their own evaluations, it is more useful for them to leam 
about their own evaluation in comparibun with the eval^^igns of others. 
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Insofar as possible, teaching evaluation should be conducted on a 
conparative basis, 

6. Some individuals may mike incorrect interpretations o£ their 
assessments , because they are not sophisticated in reading such data or 
their feelings may interfere with their understanding. For these 
reasons it may be desirable to build in follow-up procedures in vdiich 
the results of teaching evaluation may be discussed, interpreted, and 
implications for changes (if any) drawn with the teacher. Such counseling 
would obviously be a delicate matter, but it can be used to assist 
teachers make good use of the assessment data. 

7. One of the stickiest issues concerning evaluation concerns the 
use to which tlie results are put. It seems to me that the most iinportant 
use is for them to be linked together with a faculty development program. 
A full-fledged faculty development program would be designed to assist 
individual faculty members to develop to their fullest both prof^-s5ionally 
and personally. There ought to be a variety of resources available at 

an institution including opportunities for micro -teaching, learning about 
new techniques of teaching and learning, and the like to help faculty 
become more effective persons and teachers. Teaching evaluations could 
be used to help identify problems which could be aided by means of a 
comprehensive faculty development program. 

8. The results of teaching evaluation ought to be used, also, to 
make decisions about retention, promotion, and tenure. It is in the 
self-interest of the institution, and the entire professoriate, to 
retain, promote, and award tenure to those persons v*io are adjudged by 
the best available evidence to be effective teachers. This is especially 
true today when we have an abudance of prospective teachers for each 
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open position; unlike the days of a teacher shortage, there is little 
justification for rewarding ineffective teaching any more. 

9. Most teaching evaluation procedures attempt to learn how well 
individual teachers are perfoming within the general university structure. 
Yet, we know that individuals are severely constrained by their environments; 
the institutional climate, faculty value scheme, peer group pressures, 
and institutional organization all impose limitatio-is on the effectiveness 
of any individual. Further, teaching may be significantly impro/ed by 
modifying the environment within which a faculty member teaches. Thus, 
inno"£tions such as cluster colleges, offer iiig alternative educational 
ei.v^ronments, should be encouraged wi:h vigor at lear>t equal to ^«it 
propelling teaching evaluation. 

10. A few schoolb have decided that they can best respond to the 
need to improve instiuction by creating teaching reiiource centers. 
Although such centers vary in size, stmcture, and program, they all 
provide some of the services discussed earlier to help faculty members 
improve their teaching. Because there will be few additional faculty 
positions at most schools in the foreseeable future, an increasing need 
will be to help the existing faculty to grow and develop as teachers. 
For this reason I think we can and should ].ook forward to these offices 
becoming the newest entries on the organization charts of many institutions. 

It is my conviction that the nev/ directions in research and theory 
I have suggested will allow us to better understand the complicated 
dynamics of teachers who make a difference with students and that the 
suggestic»ns for going beyond the current use of student ratings in practice 
will allow faculty members to make a greater impact in the education of 
students. « » o> « « 
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TABLE 1 

STUDENT PERCEPTIONS OF THE WAYS 
INFLUENTIAL FACULTY MBIBERS HTiLPED THEM 
(In Percentages) 
(N - 1127) 



Not at all Somewhat (^ite Very 
cTATCMnNrr descriptive descriptive descriptive descriptive 

STATEMENT (2) (3) (4) 



Encouraged me to inspect m/ 
values 



He or she: 

Was available and open to any 

discussion 4 17 30 51 

Stimulated me intellectually 3 16 35 46 

Helped me feel confident 

of ray own abilities 9 IS 35 37 

Demanded high quality work 

work from me 11 19 32 37 

Interested mt in his/her field 10 24 31 35 

31 25 26 17 



Advised me about my career ,^ 

plans 31 31 22 16 

Made me aware of social issues 36 31 21 13 

Couiiseled me about a personal 

problem 59 22 9 10 
Helped me get a job or 

scholarship 71 12 8 10 
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The major focus of this research report is on the stresses facul Jy 
members feel as they conduct their work and the relationship between 
these conflicts or pressures and their performance as classroom teachers. 
The performance measures used in the study are ratings of teaching 
effectiveness by faculty colleagues and also ratings of teaching 
effectiveness by students in each professor's classes. Therefore, a 
second focus will be upon the extent to which stude it and faculty 
raters agree about the teaching effectiveness of professors under 
different conditions of stress and with various personal characteristics. 

Hiese data are part of a larger study designed to apply the 
propositions of role conflict theory and organizational stress to 
the workings of a small baccalaureate college, Ihe basic notions of 
this theoretical framework are best presented in diagram foim. (See 
Figure 1.) 



The conceptual framework for the study comes from work on role 
sets and role conflict by Ro^art Kahn and colleagues (1964) in relation 
to studies of personal health in organizations. In their theoretical 
model, both personal characteristics and the organizational environment 
directly affect outcome variables (e.g., performance on the job, or 
satisfaction) . Additionally, an interaction between the individual and 
the organization takes place as the i)erson works in the job environment. 



Insert Figure 1 about here 
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Ihis fit between the person and the organization, a created psycholog- 
ical environment, also directly affects outcomes* 

Stresses or conflicts in this situation can take many forms , 
but one of the most common r.eactions to heavy or ambiguous demands of 
the job is to feel unduly pressured and loaded down. The psycholog- 
ical environment, or the fit of the person and the organization, will 
moderate this reaction; some people respond to heavy work deuiands more 
quickly or more negatively than others. But, in general, when a focal 
person says he feels highly overloaded, it is like saying that he 
feels the pressures are beyond his particular inclination or capacity 
to cope with them effectively. The central hypothesis of this study 
is that a person's responses to the stress of role overload will be 
detrimental to role performance, and that the extent of this effect 
will be moderated by the enduring personal properties of the person. 

TWo forms of role overload are selected for primary attention. 
The first is quantitative (QT) overload, or the discrepancy the 
individual feelf between job requirements and the tame available > 
accomplish them. With professionals, such as faculty members, this 
time factor is concerned with preferred use of time as well as with 
the actual nuniber of hours available. The other factor is qualitative 
(QL) overload, the discrepancy between the demands of the job and the 
person's sense of being able to meet the demands irrespective of time. 

Both quantitative and qualitative overload are expected to lead 
to iinpaired job performance, although through somewhat different mecha- 
nisms. Quantitative overload, by definition, means the person feels he 
cannot perfom his job in the way expected by all of his role senders 
because there is too much work for kun to do in the time available. 
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therefore, their evaluations of his porformsnce are likely to suffer* 
But if vfork demands conflict with self-attributed lack of ability or 
skill, leading to qualitative overload, the effect may be most 
apparent in a lowered level of job attention and satisfaction. These 
conditions in turn, may contribute to lower evaluations by others. 

A high level of experienced overload in one area can reasonably 
be expected to increase the level of felt pressure in the other area. 
For instance, concern about one's ability to perfom the work (contri- 
buting to high qualitative overload) probably increases susceptibility 
to feelings of pressure frcm lack of time (quantitative overload) and 
may lead to substandard performance. Or too inuch work to do (high 
quantitative overload) might contribute to concern about succeeding 
professionally which would be reflected in feelings of high qualitative 
overload. Thus, though quantitative overload and qualitative overload 
are conceptually distinct, they are related, and a high level of 
either one is expected to affect role perfoxmance, A low positive 
correlation between measurements of quantitative and qualitative 
overload is expected, and both are expected to correlate negatively 
with job satisfactions and with ind^>endent ratings of job perfonnance. 

In addition to the direct effects of work performance diagrammed 
in Figure 1, this research specifically hypothesized that traits of 
the person will moderate the relationship between stress and performance. 
In teims to Figure 1, this hypothesis states that aiduring personal 
characteristics such as level of emotional sensitivity or tendency 
toward sociability (arrow 1) interact with the conflicts and stresses 
experienced in the work situation (arrow 5) to demonstrate relationships 
with work performance that are different from the direct effects of 
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either set of variables coiisidorod separately. Kxtuiplos of predic- 
tioiis from this conception are that liigh stress will be most damaging 
to the work performance of faculty nenbers who have a high level of 
emotional sensitivity, or for those professors who tend trward 
social independence rather than sociability. This hypothesis proposes 
that consideration of personal factors along with level of stress 
will improve our understanding of the relationship between stress and 
performance. 



Sibjects for this study were faculty menbers at a small liberal 
arts college that we will call •'Midwest College/' Forty-five 
professors, or 85 percent of all full-time faculty menbers, provided 
full information and are included in these results. They represent 
a variety of fields, backgrounds, and levels of academic experience. 
Hie principal faculty roles are teaching and participating in the 
general activities and operation of the college. Students are average 
in ability and variety of interests. In these respects, the college 
is similar to many general -puipose baccalaureate programs across the 
country. It is neitlier higjily selective nor self-consciously open-door, 
but middle-of-the-road and, at tlie time these data were vjllected, 
relatively traditional in its view of the teaching- learning process. 

Specifically, each faculty menber rated every other teacher in his 
curriculum division on a five-point scale of "teaching effectiveness." 
Raters were told to "consider those qualities which are important in 
the evaluation of the skills and practices and products of a classroom 
tcadier, regardless of rank or experience or training of the person being 
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rated." 

Student ©valuations of teaching effectiveness were obtained 
from a standard 14 item five -point scale questionnaire the college 
systematically fen^loyed td evaluate all courses each semester. 
Responses to the question "How would you rate your instructor in 
teaching effectiveness?" were averaged across all courses taught 
by a faculty member during the semester in which other data were 
collected. The professor's mean served as the index of his teach- 
ing performance as judged by students. 

Faculty members also completed questionnaires on academic 
attitudes and values, background characteristics, and personal 
traits. Thirty stress items similar to those used by Mueller 
(1965) were factor analyzed and yielded results consistent with 
the factors he obtained from responses by faculty menfcers in a 
large, research oriented university. The quantitative (QT) 
overload index v/as constructed by totaling weighted Individual 
responses to the five items that- loaded highest on the factor 
assigned this label. 



"the method is one of using experts, in this case professional 
colleagues, to make judgments about quality. Perhaps the best 
documented recent use of this technique, at least in higgler 
education, is the ACE ratings of doctoral programs (Cartter, 
1966; Roose and Anderson, 1971). See Clark § Blackburn (1973) 
for details concerning the analyses carried out to establish 
the reliability and validity of the measures used in the study 
here reported. 

'Overwhelming workload. Too many things to be done. 
The feeling of never having any time. 

Not being able to allocate my time and resources as I wish to. 
Not enough time to think and contemplate. <v - - 
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Hie four itens^ loading highest on the factor labeled qualitative (QL) 
overload were totaled to fonn an index of this variable. No item 
included in one index loaded above .25 on the other factor, and most 
had altexnate loadings near zero. 

Wherever possible, established measures were used to represent 
the personal attributes under study. Hie measures of emotional sensitivity 
or anxiety (Ax) is the total score on two siibscales (22 items) of the 
I.P.A.T. anxiety scale (Cattell, 1956). The flexibility (Ex) index is 
the total score from 22 items comprising the flexibility scale on the 
California Personality Inventory (Gough, 1957) . Items used to construct 
the Self-Esteem (SB) index come from two shorter measi!res, one by 
Rosenberg (1965) and the other by Cobb et al (1966). Sociability (So) 
is defined by Bass's (1967) social interaction scale in the Orientation 
Imrentory. Research Orientation (Res) is more a value than a personality 
trait and probably is less enduring and stable. This index is a factor 
score over 22 items concerned with the profession of college teaching^ 
the relative weight assigned to research and teaching as academic role 
obligations, and preferred teaching styles. The items loading highest 
on the factor are listed in footnote 4. 



^e desire to succeed. 

Not measuring up to the demands of the job, lack of training or 

kiowledge or talent. 
Responsibility for and control of people's futures. 
Con^tition to keep up with my colleagues. 

^ Research is the Rcadonic man's most important activity. 
For me, research obligations are relatively unimportant in contrast 

to teaching obligations. 
It is important for a faculty member to engage in both teaching and 
research; neither should be stressed in preference to the other. 
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questionnaire responses and institutional records also gave data 
on actual uid preferred distribution of work time on differentiated 
activities, intrinsic and extrinsic job satisfaction, teaching load, 
comnittee assignments, and the lika as well as providing standard 
demographic data. 
Results 

For many of the analyses the respondents were divided as evenly 
as possible into high and low groups on each personal attribute and 
on the relevant index of experienced stress. Hie ''high and "low" 
designations are relative terms and may or may not have any "absolute 
meaning. For exanple, this faculty reports an average work week of 
more than 56 hours. Hence those in the "low** group are still carrying 
a heavy load. Similarly, on the emotional sensitivity scale (Ax) , the 
total scores of the respondents range from 35 to 68 on a scale running 
frcm 22 to 110. The group designated more anxious or emotionally 
excitable, then has a mean score well below levels associated with 
serious emotional distress. In the opposite direction, self-esteem 
scores range from 31 to S3 out of a possible 11 to 55. Hius , in fact, 
members of the "low** self-esteem group think rather well of themselves. 
Once more, a high and lain are relative teims used only to represent 
the direction of certain factors iii ihe data analysis. 

The stress measure of quantitative overload, representing a 
discrepancy between time demands and individual preferences for time 
allocation, demor:trated negative but very low (statistically non- 
significant) correlations with age, years of experience, rank, and 
salary. There is also little apparent association between this subjective 
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measure of quantitative overload and available indicators of objective 
workload* such as teaching load or hours worked per week* This is 
perliaps not too surprising when we note that aljnost all faculty manbers , 
even division chaixmen, teac^ 10 to 15 hours per week, serve on two 
to five conmittees, and average 56 hours per week on the job. The basic 
work situation is heavy for them all. Instead, the subjective measure 
of quantitative overload seems to reflect conflict between the individual 
and the work situation rather than a direct representation of objective 
workload. For instance, professors with high QT overload scores also 
say that they feel a lot of pressure frcm college assignments, regulations, 
and requests for services. 

Intercorrelations of the stress factors and the personal attribute 
measures are presented in Table 1. 

Insert Table 1 about here 

As predicted, there is a moderate positive relationship between QT and 
QL overload (r « .36) The measure of emotional sensitivity (Ax) - 
also shows moderate relationships with stress from time pressure (QfT) , 
level of flexibility, and level of self-esteem. In general, however, 
the intercorrelations o£ these self -report variables are low, suggesting 
reasonable independence in measurement as well as conception. 

Our first hypothesis derived from the conceptual model stated 
that high stress (high QT or QL overload) would negatively affect work 
performance, or rated teaching effectiveness. Figure 2 diagrams mean 
perfoimance ratings by students and by faculty peers vrtien faculty 




^Mueller (1965) obtained a correlation of .34 between QT and QL in his 
stu<fy of university professors. 
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members are divided into low tund high groups on QT and QL overload, 
though student ratings arc somewhat lower for faculty menibers \Aio report 

Insert Figure 2 about here 

that they feel a lot of pressure and conflict, the differences are not 
statistically significant and we must reject the hypothesis. If we 
consider only experienced stress, there seems to be little effect on 
the teacher's work performance. 

' Our last results concern the interaction of enduring personal 
traits and experienced conflict on perfoimance. For these analyses, 
faculty members were divided high and low on each stress variable and 
high and low on each personal characteristic* Mean rated teaching 
effectiveness as rated by students and by faculty peers were calculated 
for each of the four cells* Figure 3 diagrams mean perfoxnance scores 
for Ian and high QT overload and low and high classifications on each 
of the five personal dimensions. 

Insert Figure. 3 about here 

As can be seen in Figure 3, students and faculty have hi^y 
similar patterns of assessment concerning faculty teaching effectiveness. 
That is, there is general agreement on relatively higher or lower ratings 
as well a.3 on the effects of stress and the moderation of this effect by 
personal characteristx-.s. This finding is in accord with correlations above 
.60 between student and faculty assessments of teaching as reported by 
Maslow and Zimmerman (1956) and Choy (1970) . 

Critics of rating procedures for measuring teacher perfoimance 
often question \Aether faculty members can (or w? 11) discriminate among 
their colleagues on this dimension, suggesting that the results are 
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likely to look very flat and uninterestijig. Inspection of Figure 3 
suggests that this is not the case. Though student ratings tend to 
exhibit slightly greater variability, and therefore reach levels of 
statistical significance some^at more often in these data, faculty 
colleagues show marked and consistent differences in evaliu^tions of 
teacJiing among their peers in relation to two separate indexes of 
stress. 

On teaching effectiveness, students are rating faculty members 
they have observed in the classroom over the course of a semester; 
faculty members are rating colleagues in the same curricular division 
with idiom they interact in various professional ways , but generally 
do not directly observe in the classroom. Factors of low and high 
stress and low' and high personal traits enter into the ratings only 
insofar as they affect the rater *s perception of the effectiveness of 
the faculty member's teaching. Given the independence of the ratings 
and the personal variables, there is remarkable consistency between 
faculty members and students across the five personal conditions. Both 
sets of raters agree that under high quantitative overload, an otherwise 
high level of teaching effectiveness definitely drops among faculty 
members who are more emotionally excitable , are more rigid, have a 
higher self-esteem, are more independent, and have a higher research 
orientation. But high QT overload has little apparent effect on the 
initially lower effectiveness of teachers t^o are calm, are more flexible* 
have lower self-esteem, are more sociable, and are more orientated 
toward teaching than research. 

The effects of qualitative overload (Figure 4) on teaching effec- 
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Insert Figure 4 about here 

tiveness are generally similar, though higher ratings on teaching effective- 
ness under some high stress conditions are apparent. Again, independent 
ratings by faculty members and students are very much alike. It appears 
that high overload stress, whether time or ability related, is haimful 
to the teaching effectiveness of the kinds of faculty members who 
tend to get the highest teadier ratings under low overload conditions, 
but is not particularly harmful (and may even be beneficial) to the 
teaching of those who receive the Ici-zer ratings v^en stress is low. 

Though at first these results for the lower rated teachers appear 
to be contradictory, they are consistent with the notion of involvement 
or "creative tension" (Pelz, 1967) as a prerequisite to top-level work 
among independent professionals working in organizational settijigs. It 
could be argued that less excitable, more flexible, more sociable, and 
more teaching-orientated f aaji *y members are adequate as teachers under 
conditions of low stress, and they continue to perform at about the 
same level they are pushed hard, either qualitatively or quantita- 
tively. In fact, they may even do better as they respond to the challenge. 
However, their counterparts fall apart under high pressure, particularly 
when it is time pressure, and their teaching suffers. Already 
maximally involved under conditions of low stress, the additional 
pressure can only be disruptive. These kinds of teachers get the highest 
ratings when they, are not too pressured. But, with high pressure, they 
cannot keep up with the demands and their work suffers. 

Both students and faculty give highest teacher ratings to faculty 
members v^o have a high research orientation, are socially independent, 

o 
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have a high value of self, are comparatively rigid, and who also do not 
feel much sense of role overload, eithw/ quantitative or qualitative. 
(See Figures 3 and 4.) Apparently these are the people v*io thrive on 
pressure, or who are self-sufficient enough to be relatively oblivious 
to it. For it is exactly these same kinds of people who are rated much 
lower in their teaching performance \Aien they also express a high level 
of work overload* 

In sunmary, these data support the association of role overload 
stress and perfoimance as moderated by personal traits and values. 
Some kinds of people are bothered by feelings of pressure vAiile others 
are less affected or even seem to be challenged by the same condition. 
However, we should note that even though the performance ratings of 
some people may actually be higher under high stress, the job satisfac- 
tions of these people suffer most under these same high stress conditions. 
Therefore, stress under any condition carries with it some penalty, though 
some effects will be reflected most directly in the ijimediate perfonnance 
of one's job. 
Conclusions 

The findings have inmediate and telling implications for the 
managing of colleges and universities and for the people who work in 
them. Faculty recruitment and retention, work assignment and load, 
the reward structure of recognition, tenure, and promotion, all need 
to take into account how performance is affected by stress and moderated 
by personal characteristics. 

For example, students rated the more rigid faculty Hharibers under 
low overload stress as their most effective teachers. Colleagues too 
valued conformity in relation '.o ratings of teaching effectiveness. 



Preswnably, when not under particular pressure, the more rigid faculty 
member is better organized and prepared while also sufficiently relaxed 
in class to be viewed as a good teacher. But, when things get tight, 
he tends to get dogmatic and flustered, and his teaching performance 
deteriorates. How he is viewed on a teacher evaluation form, then, 
will depend in part on his other work and life circumstances. There 
are two major implications for interpretation of his ratings: first, 
a pattern of rating rather than ratings at any one time should be used 
in any decision-making situation. Second, ratings should be interpreted 
in the context of other information about the individual. 

Three other illustrations point up implications. First, high 
overload appears to be detrimental to performance among those least 
able to cope with stress— the excitable, the least flexible, the socially 
^ more isolated, and the strongly research oriented. Second, faculty idio 
suffer roost under high overload are the individuals least able to deal 
constructively with frustration and discouragement, the persons for 
Tfihm increased anxiety from poor evaluations by students and peers 
(together with heavy work pressures) are apt to be most counter-productive. 
More rigid and more socially independent faculty are apt to withdraw 
further into themselves under increasing pressure. For the research 
oriented, evaluation in teaching and service become increasingly frustrating 
because they are the areas of least important personal professional 
concern. 

Third, the findings raise questions regarding a growing student 
practice, making public faculty evaluations of teaching. The student 
argument is persuasive. As clients they are entitled to full market 
infoimation. Consumer reports on faculty provide a basis on which 
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students take or do not take courses, llie student concern for improving 
teaching on campus is genuine. So is their belief that publicly ident j " .ng 
weaker teachers will produce improvement. However, their technique 
assimes all faculty could teach better if they would only try harder 
and work at it more. Maybe they Cfin, although Hildebrand (1972) has 
found that the best and worst judged teachers give equal time to the 
activity. The personality data in this study and the consequences of 
stress suggest that for some faculty public ratings will have consequences 
just the opposite from v^at is desired. 



253 



247 



References 

Bass, B, M, Social Behavior and the orientation inventory. Psychological 
Bulletin, 1967, 68, 4, 260-292. 

Cartter, A. M. An Assessment of (Xiality in Graduate Education . Washington 
D. C: American Qmcil on Sducation, 1966. 

Cattell, R. B. Handbook : The I.P.A.T. anxiety scale. Champaign, 
Illinois: Institute for Personality and Ability Testing, 1956. 

Choy, C. The relationship of college teacher effectiveness to conceptual 
systems orientation and perceptual orientation. Unpublished Ph.D. 
dissertation, Colorado State College, 1969. 

Clark, M. J. A study of organizational stress and professional performance 
of faculty members in a small four-year 0011?*: . Ph.D. dissertation in 
progress. The University of Michigan, 1973. 

Clark, M. J. , § Blackburn, R. T, Assessing faculty perfoxmance: A test 
of method, submitted for publication, 1973. 

Cobb, S., Brooks, G. H. , Kasl, S. V., § Connelly, W. B. The health of 
people changing jobs: A description of a longitudinal stu^y. Ameri can 
Journal of Public Health. 1966, 56, 1476-1481. 

Gough, H. G. Manual for the California Psychological Inventory. Palo Alto 
California, Cbnsulting Psychologists Press, Inc., 1957. 

Hildebrand, M. How to recommend promotion for a mediocre teacher without 
actually lying. Journal of Higher Education. 1972, 42, 1, 44-62. 

Kahn, R. L., Wolfe, D. M., Quinn, R. P., Snoeck, D. J., § Rosenthal, R. A* 
Organizational Stress: Studies in Role Conflict and Ambiguity . 
New Yorx: John Wiley and Sons, Inc., 1964. 

Maslow, A. H. , § Zimmerman, W. College teaching abilit)', activity, and 
personality. Journal of Educational Psychology , 1956, 47, March, 
185-189. — ; 

Mueller, I. Workload of university professors. Unpublished doctoral 
dissertation. University of Micliigan, 1965. 

Pelz, D. C. Creative tensions in the research aid develoiment climate. 
Science, 1967, 157, July 14, 160-165. 

Roose, K. D. , § Anderson, D. J. A Rating of Graduate Programs . 
Washington D. C. : American Council on Education, 1970. 

Rosenberg, M. Society and the A doles cent Self -image. Princeton University 
Press, 1965. ^ 



254 



248 

Table 1 

Intercorrelations of Stress and Personal Measures 
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Direct Effect of Stresses 
on Student and Faculty Ratings 
of Teaching Effectiveness 
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Figure 3 

Rated Teaching Effectiveness and Quantitative Overload 
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F'.gure 4 

Rated Teaching Effectiveneas and Qualitative Overload 
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